Skip to content

Repartition association catalogs #587

@camposandro

Description

@camposandro

The crossmatching products can be stored as association catalogs: https://github.com/astronomy-commons/lsdb/blob/ba4dbe6e017633d52b5369911a0da2cf8733e64b/src/lsdb/io/to_association.py#L49-L64

They will be partitioned according to the left catalog of the crossmatch.

Potential issue

LSDB does not repartition the association catalogs before writing them to disk, so we can end up with lots of very small files which could be aggregated. And even if we did, it looks like the implementation of joins via AssociationCatalog are by nature bound to the partitioning of the left catalog:

https://github.com/astronomy-commons/lsdb/blob/ba4dbe6e017633d52b5369911a0da2cf8733e64b/src/lsdb/dask/join_catalog_data.py#L389-L398

We should monitor the performance of rubin.join(other, through=..., ...) as the survey progresses and the data volume increases. If there is too much overhead reading small high order files for association catalogs we might need to revisit this and implement repartitioning.

Metadata

Metadata

Assignees

Labels

performanceFor slow queries or compute bottlenecks

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions