Description
When importing new versions of HTML pages (either from Wayback’s Memento API or from WARCs), we look for the page’s <title>
element or use the empty string:
web-monitoring-processing/web_monitoring/utils.py
Lines 169 to 180 in 9c6a2cf
There are a bunch of pages that turn out to be missing <title>
elements, so it would probably be good to fall back to looking for the <h1>
or some other title-like information in the page body.
-
Where present, the first
<h1>
seems like a reasonable fallback. Examples: -
EPA’s LASSO adds some complexity here. The title is in an
<h1>
, but another<h1>
(the first one) is a link back to the EPA home page. Maybe look for<h1>
that doesn’t contain a link to a different URL? -
Argonne Nat’l Labs has similar issues with the first
<h1>
being a link to the home page: https://monitoring.envirodatagov.org/page/d617a0c4-27b7-4bad-a190-983e25cc1819/0230138d-361b-40aa-a65b-d610f5fbe3e5..d431a180-aa91-44fc-aec9-c8a3f2706aa9 -
“National Flood Hazard Layer (NFHL)” has no heading elements at all (in the HTML; it does after scripts run, which is… not great). However, it does have:
<span class="title">National Flood Hazard Layer (NFHL)</span>
. So maybe looking for//*[contains(concat(' ',normalize-space(@class),' '),' title ')]
is good? -
For plain text, maybe the first sentence of the first line? (example)
-
A lot of error pages have no title. Maybe use
<status code> <status text>
(e.g. “404 Not Found”) in this case?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status