-
-
Notifications
You must be signed in to change notification settings - Fork 52
feat: Add Bookwyrm scraping support #1003
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add Bookwyrm scraping support #1003
Conversation
Bookwyrm doesn't provide any API access as far as I know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @luxaeternati, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
This pull request introduces scraping support for Bookwyrm editions. It identifies Bookwyrm instances using nodeinfo and scrapes book item pages. Since Bookwyrm lacks API access, web scraping is necessary. The PR also includes a script to facilitate migration from Bookwyrm, leveraging the POST /api/me/shelf/item/{item_uuid} API endpoint after items are added.
Highlights
- Feature: Bookwyrm Scraping: Adds the ability to scrape book data from Bookwyrm instances, which do not offer a public API.
- Nodeinfo Identification: Uses nodeinfo to identify Bookwyrm instances before attempting to scrape.
- Migration Script: Provides a Python script to help migrate data from Bookwyrm using the API.
Changelog
- catalog/common/models.py
- Added
Bookwyrmto theSiteNameenum. - Added
Bookwyrmto theIdTypeenum.
- Added
- catalog/sites/init.py
- Imported the
Bookwyrmclass fromcatalog/sites/bookwyrm.py.
- Imported the
- catalog/sites/bookwyrm.py
- Created a new
Bookwyrmclass to handle scraping of Bookwyrm book pages. - Implemented
validate_url_fallbackto verify if a URL is a Bookwyrm book page using nodeinfo. - Implemented the
scrapemethod to extract book details such as title, author, ISBN, publication date, publisher, cover image, page count, description, subtitle, and series information.
- Created a new
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
From Bookwyrm's page, data we glean,
No API, a scraping scene.
With Python's grace,
We find our place,
To build our catalog, serene.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces scraping support for Bookwyrm editions, which is a valuable addition given the lack of a public API. The implementation appears well-structured, utilizing nodeinfo to identify Bookwyrm instances and scraping specific book item pages. The provided migration script is also a helpful resource. However, there are a few areas that could be improved for robustness and maintainability.
Summary of Findings
- Error Handling in
validate_url_fallback: Thevalidate_url_fallbackmethod relies onCachedDownloaderwhich can raise exceptions if the nodeinfo endpoint is unavailable or returns unexpected data. This could lead to incorrect validation results. Consider adding error handling to gracefully manage these scenarios. - Robustness of Scraping Logic: The
scrapemethod uses multiplexpathqueries to extract data. If the structure of the Bookwyrm page changes, these queries could fail, leading to incomplete or incorrect data. Implement more robust error handling and consider using more specificxpathqueries or CSS selectors to minimize the impact of page structure changes. - Missing Unit Tests: The pull request lacks unit tests for the new
Bookwyrmsite implementation. Unit tests are crucial for ensuring the correctness and stability of the scraping logic and validation methods. Add unit tests to cover various scenarios, including successful scraping, handling of missing data, and validation of different URL formats.
Merge Readiness
While the core functionality appears to be implemented correctly, the lack of error handling and unit tests raises concerns about the long-term maintainability and reliability of the code. I recommend addressing the identified issues before merging this pull request. I am unable to approve this pull request, and users should have others review and approve this code before merging.
catalog/sites/bookwyrm.py
Outdated
| probe_url = "https://" + parsed.hostname + "/nodeinfo/2.0" # type: ignore | ||
| software = ( | ||
| CachedDownloader(probe_url).download().json().get("software").get("name") | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code block could raise exceptions if the probe_url is unreachable or if the JSON response doesn't contain the expected structure. Consider adding a try...except block to handle potential RequestException or KeyError exceptions and return False in case of failure.
try:
software = (
CachedDownloader(probe_url).download().json().get("software").get("name")
)
except (requests.RequestException, KeyError):
return False| map(str, tree.xpath("//meta[contains(@itemprop,'publisher')]/@content")) # type: ignore | ||
| ).strip() | ||
|
|
||
| cover_src = tree.xpath("//img[contains(@class,'book-cover')]/@src")[0] # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line assumes that the cover_src is always present in the xpath result. If the element is not found, cover_src will raise an IndexError. Consider adding a check to ensure that the list is not empty before accessing the first element.
cover_src = tree.xpath("//img[contains(@class,'book-cover')]/@src")
cover_src = cover_src[0] if cover_src else None # type: ignore| if len(pub_date) == 3: | ||
| data["pub_year"] = pub_date[0] | ||
| data["pub_month"] = pub_date[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to check the length of pub_date, but consider adding a check to ensure that the elements at index 0 and 1 exist before accessing them. This will prevent IndexError if the pub_date list has fewer than 2 elements.
if len(pub_date) >= 2:
data["pub_year"] = pub_date[0]
data["pub_month"] = pub_date[1]|
I have previously looked at Bookwyrm. If you request the URL with an "Accept: application/json" header, you will get the content as json. This might be easier to handle. My idea would have been to make it part of the fediverse scraper, looking at the slight differences between the json output between Bookwyrm and NeoDB. But I didn't get around to finish the code. |
Yes json serialization is happening to objects inheriting, if I remember correctly, the
|
settle with a fixed id that doesn't vary according to bookwyrm redirection
|
Thanks for the PR! I'm sure we'll federate with Bookwyrm on day, but link to their url might be a nice first step #725 I'm open to merge this is you can
bonus points (or save for future if too complicated):
it's a bit of ask, but I hope we can do it right, especially for another Fediverse service. |
| @classmethod | ||
| def validate_url_fallback(cls, url: str): | ||
| parsed = urlparse(url) | ||
| probe_url = "https://" + parsed.hostname + "/nodeinfo/2.0" # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good if the result to check the host can be cached.
Add scraping support for Bookwyrm editions.
Use nodeinfo to identify bookwyrm instance. Then if the url matches book item page type, scrape.
As far as I know Bookwyrm doesn't offer API access, so scraping the web page is needed.
This also enables migration from bookwyrm. After the items are all added, the API endpoint
POST /api/me/shelf/item/{item_uuid}can be utilized for migration. I attach the following script for the convenience of others: