Skip to content

Overriding suffix_list_urls #361

@leeprevost

Description

@leeprevost

Hope you are well @john-kurkowski.

I'm finding some strange behaviour where tldextract returns a null string using "top_domain_under_public_suffix" property from urls such as:

urls that end with ".er"

urls with [hostname].jp

urls with [hostname].sch.uk

extract("test.kobe.jp").top_domain_under_public_suffix
""

I think this is likely due to erroneous use of the * in the public suffix list such as:
*.er
*.sch.uk
*.kobe.jp

if this is the case and you agree, I'm now considering taking the original public suffix list and cleansing it slightly to remove the erroneous problems. I'm using tldextract to operate on billions of nodes in the common crawl corpus.

so, considering:

  1. Rules regex such as (r"^*.(\w+.jp)$", r"\1") to replace "*.kobe.jp" with "kobe.jp" in a my_custom_list
  2. Reinitialize with tldextract.TLDExtract(suffix_list_urls = my_custom_list)

Does this sound right? Agree these are problems in the https://publicsuffix.org/list/public_suffix_list.dat ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions