Use specific stemmer by dataset according to the language #1437

AndreaFrancis · 2023-06-27T13:46:44Z

Currently, 'porter' stemmer is used by default for duckdb indexing here https://github.com/huggingface/datasets-server/pull/1296/files#diff-d9a2c828d7feca3b7f9e332e040ef861e842a16d18276b356461d2aa34396a8aR145
See https://duckdb.org/docs/extensions/full_text_search.html for more details about 'stemmer' parameter.
In the future, we could try to identify the dataset language and use an appropriate stemmer parameter when creating the fts index.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-07-27T15:04:15Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

severo · 2023-08-07T16:35:13Z

P2 label should have prevented the stale bot from closing the issue. Fixed with #1635.

severo · 2023-08-15T22:54:21Z

https://datasets-server.huggingface.co/search?dataset=HeshamHaroon%2FQA_Arabic&config=HeshamHaroon--QA_Arabic&split=train&query=%D9%85%D9%86&offset=0&limit=100 returns no result, while it should (query=من)

severo · 2023-08-15T22:54:51Z

Same issue with https://datasets-server.huggingface.co/search?dataset=satpalsr%2FindicCorpv2&config=pa&split=train&query=%E0%A8%A1%E0%A8%B0%E0%A8%BE%E0%A8%85%E0%A8%B0%E0%A8%BE%E0%A8%82&offset=0&limit=100 (query=ਡਰਾਅਰਾਂ)

severo · 2023-08-15T22:57:23Z

Works well on French: https://datasets-server.huggingface.co/search?dataset=allocine&config=allocine&split=train&query=quotidien&offset=0&limit=100

AndreaFrancis · 2024-01-17T12:56:10Z

I tried with Arabic and Russian stemmer as in the Duckdb doc but I wasn't able to perform a simple query using FTS. I posted an issue here duckdb/duckdb#10254

AndreaFrancis · 2024-01-18T19:33:35Z

duckdb/duckdb#10254 has been fixed, but I think we will need to solve #1914 and find of a way to not break search when updating duckdb version.

severo · 2024-01-19T10:12:25Z

good reaction time from the duckdb team!

AndreaFrancis · 2024-01-19T12:11:39Z

https://pypi.org/project/duckdb/0.9.3.dev2934/ pre-release looks to have fixed FTS for non ascii characters, is this a version we can currently use? or should we wait for an official release?

severo · 2024-01-19T12:25:06Z

let's try, I would say

AndreaFrancis · 2024-03-14T14:16:40Z

Do we still need to work on this? I have seen that using porter stemmer works in other languages like Arabic and russian

AndreaFrancis · 2024-06-20T20:40:45Z

#2928 will add a specific stemmer for a dataset only if it is marked as monolingual. (That is, only one language for all splits). But there are some caveats as:

DuckDB only supports 26 stemmer languages (See https://duckdb.org/docs/extensions/full_text_search.html#pragma-create_fts_index)
What if a dataset supports more than one language? (Assuming we are getting the language using HfApi - card data - language) which of these languages should we use for the split? Even if the config name has the language name, we could try to infer the language for the split using tools like langdetect or fast text.
In the same split, there could be different columns for different languages. I was thinking of creating one index per language (or maybe per column) and, in the end, combining all the results given the search criteria or trying to evaluate embeddings with multilingual models. (Maybe this was the idea for multilingual datasets, @lhoestq ?).
Any comments? @huggingface/dataset-viewer

lhoestq · 2024-06-21T16:19:31Z

Starting with the monolingual sounds like the best idea, since as you explained it can be quite complex to handle multilingual datasets. The list of 26 is a good start already, and we can surely fallback on the porter stemmer.

For multilingual datasets, ideally duckdb could allow using multiple stemmers somehow ? Let's see with them I guess

polinaeterna · 2024-07-04T10:13:53Z

To extend the list of supported languages: I've found the idea of using proxy tokenizers for languages that don't have dedicated tokenizers - for them one can use tokenizers of the closest related languages (if they have the same writing system).
For example, there is a list in datatrove lib: https://github.com/huggingface/datatrove/blob/898efc0fc6ee2050f8ef78f7236cace2b26f2824/src/datatrove/utils/word_tokenizers.py#L297 (I'm questioning some choices there :D but the idea is nice).

Maybe we can use the same idea for stemmers?

severo · 2024-08-22T00:45:06Z

maybe close this issue since it's basically done now with #2928, and open new issues if we want to support multilingual datasets, have more than 26 stemmers, or other improvements as proposed by Polina?

severo added improvement / optimization P2 Nice to have labels Jul 27, 2023

github-actions bot closed this as completed Aug 5, 2023

severo reopened this Aug 7, 2023

severo added P1 Not as needed as P0, but still important/wanted and removed P2 Nice to have labels Feb 6, 2024

severo mentioned this issue Jun 20, 2024

FTS: Add specific stemmer for monolingual datasets #2928

Merged

severo added P2 Nice to have and removed P1 Not as needed as P0, but still important/wanted labels Aug 22, 2024

AndreaFrancis mentioned this issue Nov 18, 2024

search function and stopwords #3104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use specific stemmer by dataset according to the language #1437

Use specific stemmer by dataset according to the language #1437

AndreaFrancis commented Jun 27, 2023

github-actions bot commented Jul 27, 2023

severo commented Aug 7, 2023

severo commented Aug 15, 2023

severo commented Aug 15, 2023

severo commented Aug 15, 2023

AndreaFrancis commented Jan 17, 2024

AndreaFrancis commented Jan 18, 2024

severo commented Jan 19, 2024 •

edited

Loading

AndreaFrancis commented Jan 19, 2024

severo commented Jan 19, 2024

AndreaFrancis commented Mar 14, 2024

AndreaFrancis commented Jun 20, 2024

lhoestq commented Jun 21, 2024

polinaeterna commented Jul 4, 2024

severo commented Aug 22, 2024

Use specific stemmer by dataset according to the language #1437

Use specific stemmer by dataset according to the language #1437

Comments

AndreaFrancis commented Jun 27, 2023

github-actions bot commented Jul 27, 2023

severo commented Aug 7, 2023

severo commented Aug 15, 2023

severo commented Aug 15, 2023

severo commented Aug 15, 2023

AndreaFrancis commented Jan 17, 2024

AndreaFrancis commented Jan 18, 2024

severo commented Jan 19, 2024 • edited Loading

AndreaFrancis commented Jan 19, 2024

severo commented Jan 19, 2024

AndreaFrancis commented Mar 14, 2024

AndreaFrancis commented Jun 20, 2024

lhoestq commented Jun 21, 2024

polinaeterna commented Jul 4, 2024

severo commented Aug 22, 2024

severo commented Jan 19, 2024 •

edited

Loading