Overriding suffix_list_urls

Hope you are well @john-kurkowski.

I'm finding some strange behaviour where tldextract returns a null string using "top_domain_under_public_suffix" property from urls such as:

urls that end with ".er"

urls with [hostname].jp

urls with [hostname].sch.uk

`extract("test.kobe.jp").top_domain_under_public_suffix`
""

I think this is likely due to erroneous use of the * in the public suffix list such as:
*.er
*.sch.uk
*.kobe.jp

if this is the case and you agree, I'm now considering taking the original public suffix list and cleansing it slightly to remove the erroneous problems.   I'm using tldextract to operate on billions of nodes in the common crawl corpus.  

so, considering:
1) Rules regex such as (r"^\*\.(\w+\.jp)$", r"\1") to replace "*.kobe.jp" with "kobe.jp" in a my_custom_list
2) Reinitialize with tldextract.TLDExtract(suffix_list_urls = my_custom_list)

Does this sound right?  Agree these are problems in the https://publicsuffix.org/list/public_suffix_list.dat ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Overriding suffix_list_urls #361

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Overriding suffix_list_urls #361

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions