You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "/mnt/c/Users/pgomes/Desktop/Code/venv/bin/cdx-indexer", line 8, in
sys.exit(main())
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 469, in main
minimal=cmd.minimal_cdxj)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 301, in write_multi_cdx_index
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 339, in call
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 172, in create_record_iter
entry['urlkey'] = canonicalize(entry['url'], surt_ordered)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/utils/canonicalize.py", line 48, in canonicalize
raise UrlCanonicalizeException('Invalid Url: ' + url)
pywb.utils.canonicalize.UrlCanonicalizeException: Invalid Url: http://eosims.asf.alaska.edu:12355.edu:80/
And stops the whole process.
Expected behavior
Wouldn't it be better to analyze record a record? If there is an error, will it continue to process the next record for the same warc?
The text was updated successfully, but these errors were encountered:
Describe the bug
All processing stops when there is a malformed url.
Steps to reproduce the bug
For the url "http://eosims.asf.alaska.edu:12355.edu:80/" the cdxj-indexer returns:
Traceback (most recent call last):
File "/mnt/c/Users/pgomes/Desktop/Code/venv/bin/cdx-indexer", line 8, in
sys.exit(main())
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 469, in main
minimal=cmd.minimal_cdxj)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 301, in write_multi_cdx_index
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 339, in call
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 172, in create_record_iter
entry['urlkey'] = canonicalize(entry['url'], surt_ordered)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/utils/canonicalize.py", line 48, in canonicalize
raise UrlCanonicalizeException('Invalid Url: ' + url)
pywb.utils.canonicalize.UrlCanonicalizeException: Invalid Url: http://eosims.asf.alaska.edu:12355.edu:80/
And stops the whole process.
Expected behavior
Wouldn't it be better to analyze record a record? If there is an error, will it continue to process the next record for the same warc?
The text was updated successfully, but these errors were encountered: