-
-
Notifications
You must be signed in to change notification settings - Fork 212
Description
Hope you are well @john-kurkowski.
I'm finding some strange behaviour where tldextract returns a null string using "top_domain_under_public_suffix" property from urls such as:
urls that end with ".er"
urls with [hostname].jp
urls with [hostname].sch.uk
extract("test.kobe.jp").top_domain_under_public_suffix
""
I think this is likely due to erroneous use of the * in the public suffix list such as:
*.er
*.sch.uk
*.kobe.jp
if this is the case and you agree, I'm now considering taking the original public suffix list and cleansing it slightly to remove the erroneous problems. I'm using tldextract to operate on billions of nodes in the common crawl corpus.
so, considering:
- Rules regex such as (r"^*.(\w+.jp)$", r"\1") to replace "*.kobe.jp" with "kobe.jp" in a my_custom_list
- Reinitialize with tldextract.TLDExtract(suffix_list_urls = my_custom_list)
Does this sound right? Agree these are problems in the https://publicsuffix.org/list/public_suffix_list.dat ?