Skip to content

Commit

Permalink
Lyrics: Refactor Genius, Google backends, and consolidate common func…
Browse files Browse the repository at this point in the history
…tionality (#5474)

### Bug Fixes
- Fixed #4791: Resolved an issue with the Genius backend where it
couldn't match lyrics if there was a slight variation in the artist's
name.

### Plugin Enhancements
* **Session Management**: Introduced a `TimeoutSession` to enable
connection pooling and maintain consistent configuration across
requests.
* **Error Handling**: Centralized error handling logic in a new
`RequestsHandler` class, which includes methods for retrieving either
HTML text or JSON data.
* **Logging**: Added methods to ensure the backend name is included in
log messages.

### Configuration Changes
* Added a new `dist_thresh` field to the configuration, allowing users
to control the maximum tolerable mismatch between the artist and title
of the lyrics search result and their item. Interestingly, this field
was previously available (though undocumented) and used in the
`Tekstowo` backend. Now, this threshold has also been applied to
**Genius** and **Google** search logic.

### Backend Updates
* All backends that perform searches now validate each result against
the configured `dist_thresh`.

#### Genius
* Removed the need to scrape HTML tags for lyrics; instead, lyrics are
now parsed from the JSON data embedded in the HTML. This change should
reduce our vulnerability to Genius' frequent alterations in their HTML
structure.
* Documented the structure of their search JSON data.

#### Google
* Typed the response data returned by the Google Custom Search API.
* Excluded certain pages under **https://letras.mus.br** that do not
contain lyrics.
* Excluded all results from MusiXmatch, as we cannot access their pages.
* Improved parsing of URL titles (used for matching item/lyrics
artist/title):
- Handled results from long search queries where URL titles are
truncated with an ellipsis.
  - Enhanced URL title cleanup logic.
- Added functionality to determine (or rather, guess) not only the track
title but also the artist from the URL title.
* Similar to #5406, search results are now compared to the original item
and sorted by distance. Results exceeding the configured `dist_thresh`
value are discarded. The previous functionality simply selected the
first result containing the track's title in its URL, which often led to
returning lyrics for the wrong artist, particularly for short track
titles.
* Since we now fetch lyrics confidently, redundant checks for valid
lyrics and credits cleanup have been removed.

### HTML Cleanup
* Organized regex patterns into a new `Html` class.
* Adjusted patterns to ensure new lines between blocks of lyrics text
scraped from `letras.mus.br` and `musica.com`.
* Modified patterns to scrape missing lyrics text on `paroles.net` and
`lacoccinelle.net`. See the diff in `test/plugins/lyrics_page.py`.
  • Loading branch information
snejus authored Jan 27, 2025
2 parents 80bc539 + dab9a0d commit a1c0ebd
Show file tree
Hide file tree
Showing 7 changed files with 854 additions and 759 deletions.
115 changes: 115 additions & 0 deletions beetsplug/_typing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
from __future__ import annotations

from typing import Any

from typing_extensions import NotRequired, TypedDict

JSONDict = dict[str, Any]


class LRCLibAPI:
class Item(TypedDict):
"""Lyrics data item returned by the LRCLib API."""

id: int
name: str
trackName: str
artistName: str
albumName: str
duration: float | None
instrumental: bool
plainLyrics: str
syncedLyrics: str | None


class GeniusAPI:
"""Genius API data types.
This documents *only* the fields that are used in the plugin.
:attr:`SearchResult` is an exception, since I thought some of the other
fields might be useful in the future.
"""

class DateComponents(TypedDict):
year: int
month: int
day: int

class Artist(TypedDict):
api_path: str
header_image_url: str
id: int
image_url: str
is_meme_verified: bool
is_verified: bool
name: str
url: str

class Stats(TypedDict):
unreviewed_annotations: int
hot: bool

class SearchResult(TypedDict):
annotation_count: int
api_path: str
artist_names: str
full_title: str
header_image_thumbnail_url: str
header_image_url: str
id: int
lyrics_owner_id: int
lyrics_state: str
path: str
primary_artist_names: str
pyongs_count: int | None
relationships_index_url: str
release_date_components: GeniusAPI.DateComponents
release_date_for_display: str
release_date_with_abbreviated_month_for_display: str
song_art_image_thumbnail_url: str
song_art_image_url: str
stats: GeniusAPI.Stats
title: str
title_with_featured: str
url: str
featured_artists: list[GeniusAPI.Artist]
primary_artist: GeniusAPI.Artist
primary_artists: list[GeniusAPI.Artist]

class SearchHit(TypedDict):
result: GeniusAPI.SearchResult

class SearchResponse(TypedDict):
hits: list[GeniusAPI.SearchHit]

class Search(TypedDict):
response: GeniusAPI.SearchResponse


class GoogleCustomSearchAPI:
class Response(TypedDict):
"""Search response from the Google Custom Search API.
If the search returns no results, the :attr:`items` field is not found.
"""

items: NotRequired[list[GoogleCustomSearchAPI.Item]]

class Item(TypedDict):
"""A Google Custom Search API result item.
:attr:`title` field is shown to the user in the search interface, thus
it gets truncated with an ellipsis for longer queries. For most
results, the full title is available as ``og:title`` metatag found
under the :attr:`pagemap` field. Note neither this metatag nor the
``pagemap`` field is guaranteed to be present in the data.
"""

title: str
link: str
pagemap: NotRequired[GoogleCustomSearchAPI.Pagemap]

class Pagemap(TypedDict):
"""Pagemap data with a single meta tags dict in a list."""

metatags: list[JSONDict]
Loading

0 comments on commit a1c0ebd

Please sign in to comment.