Skip to content

CachingHiveMetastore.getTableColumnStatistics not effective for some queries #21081

Open
@losipiuk

Description

@losipiuk

After 52a17f1 we are keying cache entries in CachingHiveMetastore on set of columns (previously stats for all the columns were pulled from metastore).
As a result we may end up with more roundtrips to metastore for a query which happens to consult HiveMetastore multiple times for different set of columns of a single table.
In case communication with metastore is costly it causes performance regression.

Edit: actually the caching was on per-column basis already before 52a17f1 since #16203, yet 52a17f1 changes call pattern so we observe more calls to CachingHiveMetastore sometimes. E.g. for query:

    CREATE TABLE test_self_join_table  AS SELECT 2 AS age, 0 parent, 3 AS id";
    SELECT child.age, parent.age FROM test_self_join_table child JOIN test_self_join_table parent ON child.parent = parent.id";

cc: @dain @findepi

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions