Determine title from content if `<title>` is missing

When importing new versions of HTML pages (either from Wayback’s Memento API or from WARCs), we look for the page’s `<title>` element or use the empty string: https://github.com/edgi-govdata-archiving/web-monitoring-processing/blob/9c6a2cfed53c32e886ae16ce287878beffbf9622/web_monitoring/utils.py#L169-L180

There are a bunch of pages that turn out to be missing `<title>` elements, so it would probably be good to fall back to looking for the `<h1>` or some other title-like information in the page body.

- Where present, the first `<h1>` seems like a reasonable fallback. Examples:
    - https://api.monitoring.envirodatagov.org/api/v0/versions/05cd42f1-d995-446e-a33f-04ee2f96b6ad?different=false
    - https://monitoring.envirodatagov.org/page/d1620a7d-557c-4517-89f7-53577d5d4e34/31ffa13d-b1ab-410b-b531-b8198db171bc..540fc862-3220-43b5-8a24-28f08a86554f
    - 

- [EPA’s LASSO](https://api.monitoring.envirodatagov.org/api/v0/versions/ab1019d4-c329-440a-89b8-f26b7332f90d?different=false) adds some complexity here. The title is in an `<h1>`, but another `<h1>` (the first one) is a link back to the EPA home page. Maybe look for `<h1>` that doesn’t contain a link to a different URL?

- Argonne Nat’l Labs has similar issues with the first `<h1>` being a link to the home page: https://monitoring.envirodatagov.org/page/d617a0c4-27b7-4bad-a190-983e25cc1819/0230138d-361b-40aa-a65b-d610f5fbe3e5..d431a180-aa91-44fc-aec9-c8a3f2706aa9

- [“National Flood Hazard Layer (NFHL)”](https://api.monitoring.envirodatagov.org/api/v0/versions/8a0a336d-824e-46b0-aad0-a400b369ffee?different=false) has no heading elements at all (in the HTML; it does after scripts run, which is… not great). However, it does have: `<span class="title">National Flood Hazard Layer (NFHL)</span>`. So maybe looking for `//*[contains(concat(' ',normalize-space(@class),' '),' title ')]` is good?

- For plain text, maybe the first sentence of the first line? ([example](https://api.monitoring.envirodatagov.org/api/v0/versions/f49b1179-2da4-4762-b717-847027c4ceea?different=false))

- A lot of error pages have no title. Maybe use `<status code> <status text>` (e.g. “404 Not Found”) in this case?

	def extract_title(content_bytes, encoding='utf-8'):
	"Return content of <title> tag as string. On failure return empty string."
	content_str = content_bytes.decode(encoding=encoding, errors='ignore')
	# The parser expects a file-like, so we mock one.
	content_as_file = io.StringIO(content_str)
	try:
	title = lxml.html.parse(content_as_file).find(".//title")
	except Exception:
	return ''

	if title is None or title.text is None:
	return ''

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Determine title from content if `<title>` is missing #863

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Determine title from content if <title> is missing #863

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Determine title from content if `<title>` is missing #863