Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use specific stemmer by dataset according to the language #1437

Open
AndreaFrancis opened this issue Jun 27, 2023 · 15 comments
Open

Use specific stemmer by dataset according to the language #1437

AndreaFrancis opened this issue Jun 27, 2023 · 15 comments

Comments

@AndreaFrancis
Copy link
Contributor

Currently, 'porter' stemmer is used by default for duckdb indexing here https://github.com/huggingface/datasets-server/pull/1296/files#diff-d9a2c828d7feca3b7f9e332e040ef861e842a16d18276b356461d2aa34396a8aR145
See https://duckdb.org/docs/extensions/full_text_search.html for more details about 'stemmer' parameter.
In the future, we could try to identify the dataset language and use an appropriate stemmer parameter when creating the fts index.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@severo
Copy link
Collaborator

severo commented Aug 7, 2023

P2 label should have prevented the stale bot from closing the issue. Fixed with #1635.

@severo severo reopened this Aug 7, 2023
@severo
Copy link
Collaborator

severo commented Aug 15, 2023

@severo
Copy link
Collaborator

severo commented Aug 15, 2023

@AndreaFrancis
Copy link
Contributor Author

I tried with Arabic and Russian stemmer as in the Duckdb doc but I wasn't able to perform a simple query using FTS. I posted an issue here duckdb/duckdb#10254

@AndreaFrancis
Copy link
Contributor Author

duckdb/duckdb#10254 has been fixed, but I think we will need to solve #1914 and find of a way to not break search when updating duckdb version.

@severo
Copy link
Collaborator

severo commented Jan 19, 2024

good reaction time from the duckdb team!

@AndreaFrancis
Copy link
Contributor Author

https://pypi.org/project/duckdb/0.9.3.dev2934/ pre-release looks to have fixed FTS for non ascii characters, is this a version we can currently use? or should we wait for an official release?

@severo
Copy link
Collaborator

severo commented Jan 19, 2024

let's try, I would say

@severo severo added P1 Not as needed as P0, but still important/wanted and removed P2 Nice to have labels Feb 6, 2024
@AndreaFrancis
Copy link
Contributor Author

Do we still need to work on this? I have seen that using porter stemmer works in other languages like Arabic and russian

@AndreaFrancis
Copy link
Contributor Author

#2928 will add a specific stemmer for a dataset only if it is marked as monolingual. (That is, only one language for all splits). But there are some caveats as:

  • DuckDB only supports 26 stemmer languages (See https://duckdb.org/docs/extensions/full_text_search.html#pragma-create_fts_index)
  • What if a dataset supports more than one language? (Assuming we are getting the language using HfApi - card data - language) which of these languages should we use for the split? Even if the config name has the language name, we could try to infer the language for the split using tools like langdetect or fast text.
    In the same split, there could be different columns for different languages. I was thinking of creating one index per language (or maybe per column) and, in the end, combining all the results given the search criteria or trying to evaluate embeddings with multilingual models. (Maybe this was the idea for multilingual datasets, @lhoestq ?).
    Any comments? @huggingface/dataset-viewer

@lhoestq
Copy link
Member

lhoestq commented Jun 21, 2024

Starting with the monolingual sounds like the best idea, since as you explained it can be quite complex to handle multilingual datasets. The list of 26 is a good start already, and we can surely fallback on the porter stemmer.

For multilingual datasets, ideally duckdb could allow using multiple stemmers somehow ? Let's see with them I guess

@polinaeterna
Copy link
Contributor

To extend the list of supported languages: I've found the idea of using proxy tokenizers for languages that don't have dedicated tokenizers - for them one can use tokenizers of the closest related languages (if they have the same writing system).
For example, there is a list in datatrove lib: https://github.com/huggingface/datatrove/blob/898efc0fc6ee2050f8ef78f7236cace2b26f2824/src/datatrove/utils/word_tokenizers.py#L297 (I'm questioning some choices there :D but the idea is nice).

Maybe we can use the same idea for stemmers?

@severo severo added P2 Nice to have and removed P1 Not as needed as P0, but still important/wanted labels Aug 22, 2024
@severo
Copy link
Collaborator

severo commented Aug 22, 2024

maybe close this issue since it's basically done now with #2928, and open new issues if we want to support multilingual datasets, have more than 26 stemmers, or other improvements as proposed by Polina?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants