You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When importing new versions of HTML pages (either from Wayback’s Memento API or from WARCs), we look for the page’s <title> element or use the empty string:
There are a bunch of pages that turn out to be missing <title> elements, so it would probably be good to fall back to looking for the <h1> or some other title-like information in the page body.
Where present, the first <h1> seems like a reasonable fallback. Examples:
EPA’s LASSO adds some complexity here. The title is in an <h1>, but another <h1> (the first one) is a link back to the EPA home page. Maybe look for <h1> that doesn’t contain a link to a different URL?
“National Flood Hazard Layer (NFHL)” has no heading elements at all (in the HTML; it does after scripts run, which is… not great). However, it does have: <span class="title">National Flood Hazard Layer (NFHL)</span>. So maybe looking for //*[contains(concat(' ',normalize-space(@class),' '),' title ')] is good?
For plain text, maybe the first sentence of the first line? (example)
A lot of error pages have no title. Maybe use <status code> <status text> (e.g. “404 Not Found”) in this case?
The text was updated successfully, but these errors were encountered:
When importing new versions of HTML pages (either from Wayback’s Memento API or from WARCs), we look for the page’s
<title>
element or use the empty string:web-monitoring-processing/web_monitoring/utils.py
Lines 169 to 180 in 9c6a2cf
There are a bunch of pages that turn out to be missing
<title>
elements, so it would probably be good to fall back to looking for the<h1>
or some other title-like information in the page body.Where present, the first
<h1>
seems like a reasonable fallback. Examples:EPA’s LASSO adds some complexity here. The title is in an
<h1>
, but another<h1>
(the first one) is a link back to the EPA home page. Maybe look for<h1>
that doesn’t contain a link to a different URL?Argonne Nat’l Labs has similar issues with the first
<h1>
being a link to the home page: https://monitoring.envirodatagov.org/page/d617a0c4-27b7-4bad-a190-983e25cc1819/0230138d-361b-40aa-a65b-d610f5fbe3e5..d431a180-aa91-44fc-aec9-c8a3f2706aa9“National Flood Hazard Layer (NFHL)” has no heading elements at all (in the HTML; it does after scripts run, which is… not great). However, it does have:
<span class="title">National Flood Hazard Layer (NFHL)</span>
. So maybe looking for//*[contains(concat(' ',normalize-space(@class),' '),' title ')]
is good?For plain text, maybe the first sentence of the first line? (example)
A lot of error pages have no title. Maybe use
<status code> <status text>
(e.g. “404 Not Found”) in this case?The text was updated successfully, but these errors were encountered: