Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic text processing #82

Closed
michelole opened this issue Jul 4, 2017 · 2 comments
Closed

Basic text processing #82

michelole opened this issue Jul 4, 2017 · 2 comments

Comments

@michelole
Copy link
Member

Ensure that Elastic Search is performing basic text processing, such as:

  • Plural/singular
  • Removing parenthesis
@michelole michelole added the P0 label Jul 4, 2017
@michelole michelole self-assigned this Jul 4, 2017
@michelole
Copy link
Member Author

We're using the default standard analyzer. According to the documentation:

It splits the text on word boundaries, as defined by the Unicode Consortium, and removes most punctuation. Finally, it lowercases all terms.

We could switch to the english analyzer, which performs stemming and stopwords removal. It would then handle plurals accordingly.

We need to reindex to apply these changes.

michelole pushed a commit to michelole/trec2017 that referenced this issue Jul 29, 2017
@michelole
Copy link
Member Author

English stemming worsened results form 0,7693 to 0,6884.

It could still benefit from #97, however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant