- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 20
Description
When importing new versions of HTML pages (either from Wayback’s Memento API or from WARCs), we look for the page’s <title> element or use the empty string: 
web-monitoring-processing/web_monitoring/utils.py
Lines 169 to 180 in 9c6a2cf
| def extract_title(content_bytes, encoding='utf-8'): | |
| "Return content of <title> tag as string. On failure return empty string." | |
| content_str = content_bytes.decode(encoding=encoding, errors='ignore') | |
| # The parser expects a file-like, so we mock one. | |
| content_as_file = io.StringIO(content_str) | |
| try: | |
| title = lxml.html.parse(content_as_file).find(".//title") | |
| except Exception: | |
| return '' | |
| if title is None or title.text is None: | |
| return '' | 
There are a bunch of pages that turn out to be missing <title> elements, so it would probably be good to fall back to looking for the <h1> or some other title-like information in the page body.
- 
Where present, the first <h1>seems like a reasonable fallback. Examples:
- 
EPA’s LASSO adds some complexity here. The title is in an <h1>, but another<h1>(the first one) is a link back to the EPA home page. Maybe look for<h1>that doesn’t contain a link to a different URL?
- 
Argonne Nat’l Labs has similar issues with the first <h1>being a link to the home page: https://monitoring.envirodatagov.org/page/d617a0c4-27b7-4bad-a190-983e25cc1819/0230138d-361b-40aa-a65b-d610f5fbe3e5..d431a180-aa91-44fc-aec9-c8a3f2706aa9
- 
“National Flood Hazard Layer (NFHL)” has no heading elements at all (in the HTML; it does after scripts run, which is… not great). However, it does have: <span class="title">National Flood Hazard Layer (NFHL)</span>. So maybe looking for//*[contains(concat(' ',normalize-space(@class),' '),' title ')]is good?
- 
For plain text, maybe the first sentence of the first line? (example) 
- 
A lot of error pages have no title. Maybe use <status code> <status text>(e.g. “404 Not Found”) in this case?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status