Skip to content

Conversation

@luxaeternati
Copy link

Add scraping support for Bookwyrm editions.

Use nodeinfo to identify bookwyrm instance. Then if the url matches book item page type, scrape.

As far as I know Bookwyrm doesn't offer API access, so scraping the web page is needed.


This also enables migration from bookwyrm. After the items are all added, the API endpoint POST /api/me/shelf/item/{item_uuid} can be utilized for migration. I attach the following script for the convenience of others:

import csv, json
import requests
import time

API_KEY=""

HEADERS={"Authorization": "Bearer " + API_KEY}
NEODB_HOST=""
BOOKWYRM_HOST=""

api_base = "https://" + NEODB_HOST + "/api/catalog/search"
fetch_base = "https://" + NEODB_HOST + "/api/catalog/fetch?url="
book_base = "https://" + BOOKWYRM_HOST

csvFile = open('bookwyrm-export.csv','r')
positiveFile = open('existing.jsonl','w')
negativeFile = open('nonexistent.jsonl','w')
noIsbnFile = open('noisbn.jsonl','w')
fieldNames = (
    "title",
    "author_text",
    "remote_id",
    "openlibrary_key",
    "inventaire_id",
    "librarything_key",
    "goodreads_key",
    "bnf_id",
    "viaf",
    "wikidata",
    "asin",
    "aasin",
    "isfdb",
    "isbn_10",
    "isbn_13",
    "oclc_number",
    "start_date",
    "finish_date",
    "stopped_date",
    "rating",
    "review_name",
    "review_cw",
    "review_content",
    "review_published",
    "shelf",
    "shelf_name",
    "shelf_date"
)
reader = csv.DictReader(csvFile,fieldNames)

c = 0
def toJson(row):
    relevant = {
            "title": row.get("title"),
            "author": list(map(str.strip, row.get("author_text").split(','))),
            "isbn_10": row.get("isbn_10"),
            "isbn_13": row.get("isbn_13"),
            "remote_id": row.get("remote_id"),
            }
    query_string = relevant.get("isbn_13") if relevant.get("isbn_13") != '' else relevant.get("isbn_10")
    if query_string == '':
        print("no isbn")
        noIsbnFile.write(json.dumps(relevant))
        noIsbnFile.write('\n')
        return
    print("query_string:" + query_string)
    results_count = requests.get(
            api_base + "?query=" + query_string + "&category=book").json().get("count")
    if results_count == 0:
        print("not found: " + relevant.get("title") + " by " + ",".join(relevant.get("author")))
        negativeFile.write(json.dumps(relevant))
        negativeFile.write('\n')
    else:
        print("found: " + relevant.get("title") + " by " + ",".join(relevant.get("author")))
        positiveFile.write(json.dumps(relevant))
        positiveFile.write('\n')

for row in reader:
    if c != 0:
        toJson(row)
    c += 1
    print(c)


csvFile.close()
positiveFile.close()
negativeFile.close()

negativeFile = open('nonexistent.jsonl','r')
json_list = list(map(json.loads,negativeFile))
i = 1
t = len(json_list)
for item in json_list:
    link = item.get("remote_id")
    print("fetching [" + str(i) + "/" + str(t) + "]th item: " + item.get("title") + " ...")
    fetching = requests.get(fetch_base + link,headers=HEADERS)
    while not fetching.json().get("url"):
        print("\t waiting 5 seconds")
        time.sleep(5)
        print("\t refetch")
        fetching = requests.get(fetch_base + link,headers=HEADERS)
    print("fetched: " + item.get("title") + " - " + "https://" + NEODB_HOST + fetching.json().get("url"))
    i += 1

print("Job finished.")

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @luxaeternati, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request introduces scraping support for Bookwyrm editions. It identifies Bookwyrm instances using nodeinfo and scrapes book item pages. Since Bookwyrm lacks API access, web scraping is necessary. The PR also includes a script to facilitate migration from Bookwyrm, leveraging the POST /api/me/shelf/item/{item_uuid} API endpoint after items are added.

Highlights

  • Feature: Bookwyrm Scraping: Adds the ability to scrape book data from Bookwyrm instances, which do not offer a public API.
  • Nodeinfo Identification: Uses nodeinfo to identify Bookwyrm instances before attempting to scrape.
  • Migration Script: Provides a Python script to help migrate data from Bookwyrm using the API.

Changelog

  • catalog/common/models.py
    • Added Bookwyrm to the SiteName enum.
    • Added Bookwyrm to the IdType enum.
  • catalog/sites/init.py
    • Imported the Bookwyrm class from catalog/sites/bookwyrm.py.
  • catalog/sites/bookwyrm.py
    • Created a new Bookwyrm class to handle scraping of Bookwyrm book pages.
    • Implemented validate_url_fallback to verify if a URL is a Bookwyrm book page using nodeinfo.
    • Implemented the scrape method to extract book details such as title, author, ISBN, publication date, publisher, cover image, page count, description, subtitle, and series information.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


From Bookwyrm's page, data we glean,
No API, a scraping scene.
With Python's grace,
We find our place,
To build our catalog, serene.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces scraping support for Bookwyrm editions, which is a valuable addition given the lack of a public API. The implementation appears well-structured, utilizing nodeinfo to identify Bookwyrm instances and scraping specific book item pages. The provided migration script is also a helpful resource. However, there are a few areas that could be improved for robustness and maintainability.

Summary of Findings

  • Error Handling in validate_url_fallback: The validate_url_fallback method relies on CachedDownloader which can raise exceptions if the nodeinfo endpoint is unavailable or returns unexpected data. This could lead to incorrect validation results. Consider adding error handling to gracefully manage these scenarios.
  • Robustness of Scraping Logic: The scrape method uses multiple xpath queries to extract data. If the structure of the Bookwyrm page changes, these queries could fail, leading to incomplete or incorrect data. Implement more robust error handling and consider using more specific xpath queries or CSS selectors to minimize the impact of page structure changes.
  • Missing Unit Tests: The pull request lacks unit tests for the new Bookwyrm site implementation. Unit tests are crucial for ensuring the correctness and stability of the scraping logic and validation methods. Add unit tests to cover various scenarios, including successful scraping, handling of missing data, and validation of different URL formats.

Merge Readiness

While the core functionality appears to be implemented correctly, the lack of error handling and unit tests raises concerns about the long-term maintainability and reliability of the code. I recommend addressing the identified issues before merging this pull request. I am unable to approve this pull request, and users should have others review and approve this code before merging.

Comment on lines 29 to 32
probe_url = "https://" + parsed.hostname + "/nodeinfo/2.0" # type: ignore
software = (
CachedDownloader(probe_url).download().json().get("software").get("name")
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This code block could raise exceptions if the probe_url is unreachable or if the JSON response doesn't contain the expected structure. Consider adding a try...except block to handle potential RequestException or KeyError exceptions and return False in case of failure.

try:
    software = (
        CachedDownloader(probe_url).download().json().get("software").get("name")
    )
except (requests.RequestException, KeyError):
    return False

map(str, tree.xpath("//meta[contains(@itemprop,'publisher')]/@content")) # type: ignore
).strip()

cover_src = tree.xpath("//img[contains(@class,'book-cover')]/@src")[0] # type: ignore

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line assumes that the cover_src is always present in the xpath result. If the element is not found, cover_src will raise an IndexError. Consider adding a check to ensure that the list is not empty before accessing the first element.

cover_src = tree.xpath("//img[contains(@class,'book-cover')]/@src")
cover_src = cover_src[0] if cover_src else None  # type: ignore

Comment on lines +100 to +102
if len(pub_date) == 3:
data["pub_year"] = pub_date[0]
data["pub_month"] = pub_date[1]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It's good to check the length of pub_date, but consider adding a check to ensure that the elements at index 0 and 1 exist before accessing them. This will prevent IndexError if the pub_date list has fewer than 2 elements.

if len(pub_date) >= 2:
    data["pub_year"] = pub_date[0]
    data["pub_month"] = pub_date[1]

@pmakholm
Copy link

I have previously looked at Bookwyrm.

If you request the URL with an "Accept: application/json" header, you will get the content as json. This might be easier to handle.

My idea would have been to make it part of the fediverse scraper, looking at the slight differences between the json output between Bookwyrm and NeoDB.

But I didn't get around to finish the code.

@luxaeternati
Copy link
Author

I have previously looked at Bookwyrm.

If you request the URL with an "Accept: application/json" header, you will get the content as json. This might be easier to handle.

My idea would have been to make it part of the fediverse scraper, looking at the slight differences between the json output between Bookwyrm and NeoDB.

But I didn't get around to finish the code.

Yes json serialization is happening to objects inheriting, if I remember correctly, the ActivityObject (or something similar) class, so we have, for example, https://bookwyrm.social/book/272989.json
But it is actually not significantly easier, since

  • the author field is a list of urls to the authors, not a list of names. Each author name needs to be accessed with their corresponding json url, so for example in the above example, we need to get https://bookwyrm.social/author/42094.json and extract the name from it.
  • the id needs to be extracted from the url.
    Guess it is a kind of trade off.

settle with a fixed id that doesn't vary according to bookwyrm redirection
@alphatownsman
Copy link
Member

Thanks for the PR! I'm sure we'll federate with Bookwyrm on day, but link to their url might be a nice first step #725

I'm open to merge this is you can

  • confirm Bookwyrm dev is aware and no objection
  • use API instead of web scraping if possible (NeoDB does plan to add people support, so a bit extra effort parsing those will go a long way)
  • add tests; pass test and lint check

bonus points (or save for future if too complicated):

  • work
  • remote search
  • import

it's a bit of ask, but I hope we can do it right, especially for another Fediverse service.

@classmethod
def validate_url_fallback(cls, url: str):
parsed = urlparse(url)
probe_url = "https://" + parsed.hostname + "/nodeinfo/2.0" # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good if the result to check the host can be cached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Wishlist

Development

Successfully merging this pull request may close these issues.

3 participants