-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use specific stemmer by dataset according to the language #1437
Comments
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
P2 label should have prevented the stale bot from closing the issue. Fixed with #1635. |
https://datasets-server.huggingface.co/search?dataset=HeshamHaroon%2FQA_Arabic&config=HeshamHaroon--QA_Arabic&split=train&query=%D9%85%D9%86&offset=0&limit=100 returns no result, while it should (query=من) |
I tried with Arabic and Russian stemmer as in the Duckdb doc but I wasn't able to perform a simple query using FTS. I posted an issue here duckdb/duckdb#10254 |
duckdb/duckdb#10254 has been fixed, but I think we will need to solve #1914 and find of a way to not break search when updating duckdb version. |
good reaction time from the duckdb team! |
https://pypi.org/project/duckdb/0.9.3.dev2934/ pre-release looks to have fixed FTS for non ascii characters, is this a version we can currently use? or should we wait for an official release? |
let's try, I would say |
#2928 will add a specific stemmer for a dataset only if it is marked as monolingual. (That is, only one language for all splits). But there are some caveats as:
|
Starting with the monolingual sounds like the best idea, since as you explained it can be quite complex to handle multilingual datasets. The list of 26 is a good start already, and we can surely fallback on the porter stemmer. For multilingual datasets, ideally duckdb could allow using multiple stemmers somehow ? Let's see with them I guess |
To extend the list of supported languages: I've found the idea of using proxy tokenizers for languages that don't have dedicated tokenizers - for them one can use tokenizers of the closest related languages (if they have the same writing system). Maybe we can use the same idea for stemmers? |
maybe close this issue since it's basically done now with #2928, and open new issues if we want to support multilingual datasets, have more than 26 stemmers, or other improvements as proposed by Polina? |
Currently, '
porter
' stemmer is used by default for duckdb indexing here https://github.com/huggingface/datasets-server/pull/1296/files#diff-d9a2c828d7feca3b7f9e332e040ef861e842a16d18276b356461d2aa34396a8aR145See https://duckdb.org/docs/extensions/full_text_search.html for more details about '
stemmer
' parameter.In the future, we could try to identify the dataset language and use an appropriate stemmer parameter when creating the
fts
index.The text was updated successfully, but these errors were encountered: