Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace nokogiri with jsoup #155

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Conversation

navarone-feekery
Copy link
Collaborator

@navarone-feekery navarone-feekery commented Oct 28, 2024

Summary

Nokogiri doesn't have HTML5 support for its jruby version. Luckily jruby has access to java libs, so we can get this support by using jsoup.

Jsoup and nokogiri have a mostly identical implementation of HTML extraction. Mostly it was enough to just use the correct jsoup method name.

However, nokogiri had an agnostic search method that would take either a CSS or XPath selector. Instead of this, jsoup has two different methods for either case. Because of this, some refactoring around extraction rule selectors was necessary.

Changes

  • Replace nokogiri with jsoup
  • Split extraction rules selector logic between xpath and css selectors
  • Remove nokogiri from gemfile
  • Bump jsoup to latest (1.18.1)

Release Note

Open Crawler now supports crawling HTML5 web content.

@navarone-feekery navarone-feekery changed the title Navarone/add jsoup Replace nokogiri with jsoup Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant