feat: Add Bookwyrm scraping support #1003

luxaeternati · 2025-04-14T12:05:13Z

Add scraping support for Bookwyrm editions.

Use nodeinfo to identify bookwyrm instance. Then if the url matches book item page type, scrape.

As far as I know Bookwyrm doesn't offer API access, so scraping the web page is needed.

This also enables migration from bookwyrm. After the items are all added, the API endpoint POST /api/me/shelf/item/{item_uuid} can be utilized for migration. I attach the following script for the convenience of others:

import csv, json
import requests
import time

API_KEY=""

HEADERS={"Authorization": "Bearer " + API_KEY}
NEODB_HOST=""
BOOKWYRM_HOST=""

api_base = "https://" + NEODB_HOST + "/api/catalog/search"
fetch_base = "https://" + NEODB_HOST + "/api/catalog/fetch?url="
book_base = "https://" + BOOKWYRM_HOST

csvFile = open('bookwyrm-export.csv','r')
positiveFile = open('existing.jsonl','w')
negativeFile = open('nonexistent.jsonl','w')
noIsbnFile = open('noisbn.jsonl','w')
fieldNames = (
    "title",
    "author_text",
    "remote_id",
    "openlibrary_key",
    "inventaire_id",
    "librarything_key",
    "goodreads_key",
    "bnf_id",
    "viaf",
    "wikidata",
    "asin",
    "aasin",
    "isfdb",
    "isbn_10",
    "isbn_13",
    "oclc_number",
    "start_date",
    "finish_date",
    "stopped_date",
    "rating",
    "review_name",
    "review_cw",
    "review_content",
    "review_published",
    "shelf",
    "shelf_name",
    "shelf_date"
)
reader = csv.DictReader(csvFile,fieldNames)

c = 0
def toJson(row):
    relevant = {
            "title": row.get("title"),
            "author": list(map(str.strip, row.get("author_text").split(','))),
            "isbn_10": row.get("isbn_10"),
            "isbn_13": row.get("isbn_13"),
            "remote_id": row.get("remote_id"),
            }
    query_string = relevant.get("isbn_13") if relevant.get("isbn_13") != '' else relevant.get("isbn_10")
    if query_string == '':
        print("no isbn")
        noIsbnFile.write(json.dumps(relevant))
        noIsbnFile.write('\n')
        return
    print("query_string:" + query_string)
    results_count = requests.get(
            api_base + "?query=" + query_string + "&category=book").json().get("count")
    if results_count == 0:
        print("not found: " + relevant.get("title") + " by " + ",".join(relevant.get("author")))
        negativeFile.write(json.dumps(relevant))
        negativeFile.write('\n')
    else:
        print("found: " + relevant.get("title") + " by " + ",".join(relevant.get("author")))
        positiveFile.write(json.dumps(relevant))
        positiveFile.write('\n')

for row in reader:
    if c != 0:
        toJson(row)
    c += 1
    print(c)


csvFile.close()
positiveFile.close()
negativeFile.close()

negativeFile = open('nonexistent.jsonl','r')
json_list = list(map(json.loads,negativeFile))
i = 1
t = len(json_list)
for item in json_list:
    link = item.get("remote_id")
    print("fetching [" + str(i) + "/" + str(t) + "]th item: " + item.get("title") + " ...")
    fetching = requests.get(fetch_base + link,headers=HEADERS)
    while not fetching.json().get("url"):
        print("\t waiting 5 seconds")
        time.sleep(5)
        print("\t refetch")
        fetching = requests.get(fetch_base + link,headers=HEADERS)
    print("fetched: " + item.get("title") + " - " + "https://" + NEODB_HOST + fetching.json().get("url"))
    i += 1

print("Job finished.")

Bookwyrm doesn't provide any API access as far as I know.

gemini-code-assist

Hello @luxaeternati, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces scraping support for Bookwyrm editions. It identifies Bookwyrm instances using nodeinfo and scrapes book item pages. Since Bookwyrm lacks API access, web scraping is necessary. The PR also includes a script to facilitate migration from Bookwyrm, leveraging the POST /api/me/shelf/item/{item_uuid} API endpoint after items are added.

Highlights

Feature: Bookwyrm Scraping: Adds the ability to scrape book data from Bookwyrm instances, which do not offer a public API.
Nodeinfo Identification: Uses nodeinfo to identify Bookwyrm instances before attempting to scrape.
Migration Script: Provides a Python script to help migrate data from Bookwyrm using the API.

Changelog

catalog/common/models.py
- Added Bookwyrm to the SiteName enum.
- Added Bookwyrm to the IdType enum.
catalog/sites/init.py
- Imported the Bookwyrm class from catalog/sites/bookwyrm.py.
catalog/sites/bookwyrm.py
- Created a new Bookwyrm class to handle scraping of Bookwyrm book pages.
- Implemented validate_url_fallback to verify if a URL is a Bookwyrm book page using nodeinfo.
- Implemented the scrape method to extract book details such as title, author, ISBN, publication date, publisher, cover image, page count, description, subtitle, and series information.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

From Bookwyrm's page, data we glean,
No API, a scraping scene.
With Python's grace,
We find our place,
To build our catalog, serene.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces scraping support for Bookwyrm editions, which is a valuable addition given the lack of a public API. The implementation appears well-structured, utilizing nodeinfo to identify Bookwyrm instances and scraping specific book item pages. The provided migration script is also a helpful resource. However, there are a few areas that could be improved for robustness and maintainability.

Summary of Findings

Error Handling in validate_url_fallback: The validate_url_fallback method relies on CachedDownloader which can raise exceptions if the nodeinfo endpoint is unavailable or returns unexpected data. This could lead to incorrect validation results. Consider adding error handling to gracefully manage these scenarios.
Robustness of Scraping Logic: The scrape method uses multiple xpath queries to extract data. If the structure of the Bookwyrm page changes, these queries could fail, leading to incomplete or incorrect data. Implement more robust error handling and consider using more specific xpath queries or CSS selectors to minimize the impact of page structure changes.
Missing Unit Tests: The pull request lacks unit tests for the new Bookwyrm site implementation. Unit tests are crucial for ensuring the correctness and stability of the scraping logic and validation methods. Add unit tests to cover various scenarios, including successful scraping, handling of missing data, and validation of different URL formats.

Merge Readiness

While the core functionality appears to be implemented correctly, the lack of error handling and unit tests raises concerns about the long-term maintainability and reliability of the code. I recommend addressing the identified issues before merging this pull request. I am unable to approve this pull request, and users should have others review and approve this code before merging.

gemini-code-assist · 2025-04-14T12:05:38Z

catalog/sites/bookwyrm.py

+        probe_url = "https://" + parsed.hostname + "/nodeinfo/2.0"  # type: ignore
+        software = (
+            CachedDownloader(probe_url).download().json().get("software").get("name")
+        )


This code block could raise exceptions if the probe_url is unreachable or if the JSON response doesn't contain the expected structure. Consider adding a try...except block to handle potential RequestException or KeyError exceptions and return False in case of failure.

try: software = ( CachedDownloader(probe_url).download().json().get("software").get("name") ) except (requests.RequestException, KeyError): return False

gemini-code-assist · 2025-04-14T12:05:38Z

catalog/sites/bookwyrm.py

+            map(str, tree.xpath("//meta[contains(@itemprop,'publisher')]/@content"))  # type: ignore
+        ).strip()
+
+        cover_src = tree.xpath("//img[contains(@class,'book-cover')]/@src")[0]  # type: ignore


This line assumes that the cover_src is always present in the xpath result. If the element is not found, cover_src will raise an IndexError. Consider adding a check to ensure that the list is not empty before accessing the first element.

cover_src = tree.xpath("//img[contains(@class,'book-cover')]/@src") cover_src = cover_src[0] if cover_src else None # type: ignore

gemini-code-assist · 2025-04-14T12:05:38Z

catalog/sites/bookwyrm.py

+        if len(pub_date) == 3:
+            data["pub_year"] = pub_date[0]
+            data["pub_month"] = pub_date[1]


It's good to check the length of pub_date, but consider adding a check to ensure that the elements at index 0 and 1 exist before accessing them. This will prevent IndexError if the pub_date list has fewer than 2 elements.

if len(pub_date) >= 2: data["pub_year"] = pub_date[0] data["pub_month"] = pub_date[1]

pmakholm · 2025-04-15T18:59:59Z

I have previously looked at Bookwyrm.

If you request the URL with an "Accept: application/json" header, you will get the content as json. This might be easier to handle.

My idea would have been to make it part of the fediverse scraper, looking at the slight differences between the json output between Bookwyrm and NeoDB.

But I didn't get around to finish the code.

luxaeternati · 2025-04-15T21:48:14Z

I have previously looked at Bookwyrm.

If you request the URL with an "Accept: application/json" header, you will get the content as json. This might be easier to handle.

My idea would have been to make it part of the fediverse scraper, looking at the slight differences between the json output between Bookwyrm and NeoDB.

But I didn't get around to finish the code.

Yes json serialization is happening to objects inheriting, if I remember correctly, the ActivityObject (or something similar) class, so we have, for example, https://bookwyrm.social/book/272989.json
But it is actually not significantly easier, since

the author field is a list of urls to the authors, not a list of names. Each author name needs to be accessed with their corresponding json url, so for example in the above example, we need to get https://bookwyrm.social/author/42094.json and extract the name from it.
the id needs to be extracted from the url.
Guess it is a kind of trade off.

settle with a fixed id that doesn't vary according to bookwyrm redirection

alphatownsman · 2025-04-22T05:17:25Z

Thanks for the PR! I'm sure we'll federate with Bookwyrm on day, but link to their url might be a nice first step #725

I'm open to merge this is you can

confirm Bookwyrm dev is aware and no objection
use API instead of web scraping if possible (NeoDB does plan to add people support, so a bit extra effort parsing those will go a long way)
add tests; pass test and lint check

bonus points (or save for future if too complicated):

work
remote search
import

it's a bit of ask, but I hope we can do it right, especially for another Fediverse service.

alphatownsman · 2025-04-22T05:18:38Z

catalog/sites/bookwyrm.py

+    @classmethod
+    def validate_url_fallback(cls, url: str):
+        parsed = urlparse(url)
+        probe_url = "https://" + parsed.hostname + "/nodeinfo/2.0"  # type: ignore


would be good if the result to check the host can be cached.

luxaeternati added 4 commits April 14, 2025 13:09

Add bookwyrm support

bad99d9

Bookwyrm doesn't provide any API access as far as I know.

Update __init__.py

b891a41

Update models.py

f0c24a0

fix: add url check

4ecb07f

github-project-automation bot added this to NeoDB Roadmap Apr 14, 2025

github-project-automation bot moved this to Wishlist in NeoDB Roadmap Apr 14, 2025

gemini-code-assist bot reviewed Apr 14, 2025

View reviewed changes

gemini-code-assist bot suggested changes Apr 14, 2025

View reviewed changes

fix: actually add the model

0a64324

luxaeternati added 3 commits April 16, 2025 00:20

handle redirection

6f2638d

settle with a fixed id that doesn't vary according to bookwyrm redirection

fix: deal with pyright

c331c79

fix: handle exception

9740891

alphatownsman reviewed Apr 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Add Bookwyrm scraping support #1003

feat: Add Bookwyrm scraping support #1003

Uh oh!

luxaeternati commented Apr 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 14, 2025

Uh oh!

gemini-code-assist bot Apr 14, 2025

Uh oh!

gemini-code-assist bot Apr 14, 2025

Uh oh!

pmakholm commented Apr 15, 2025

Uh oh!

luxaeternati commented Apr 15, 2025

Uh oh!

alphatownsman commented Apr 22, 2025

Uh oh!

alphatownsman Apr 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

feat: Add Bookwyrm scraping support #1003

Are you sure you want to change the base?

feat: Add Bookwyrm scraping support #1003

Uh oh!

Conversation

luxaeternati commented Apr 14, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Changelog

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Summary of Findings

Merge Readiness

Uh oh!

gemini-code-assist bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

pmakholm commented Apr 15, 2025

Uh oh!

luxaeternati commented Apr 15, 2025

Uh oh!

alphatownsman commented Apr 22, 2025

Uh oh!

alphatownsman Apr 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants