-
Notifications
You must be signed in to change notification settings - Fork 244
Support persisting TableMetadata in the metastore #433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think this is generally going in the right direction and I'm excited to see this kind of development. My one big ask is to split out the StringLoadingResponse from the actual caching here into another PR. At the moment it's a little bit large although I think both changes are going to be good for the project. It would at least be easier for my brain :) |
…laris into metadata-cache-p1
Sounds good @RussellSpitzer, thanks for taking a look. I will unwind the changes related to directly serving the metadata.json content from persistence for now. |
Per @RussellSpitzer's review I've removed references to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at updateTable
in IcebergCatalogHandler.java
- this also returns back LoadTableResponses and calls load table in CatalogHandlerUtils
. Is this also a potential call that could be optimized using this additional persistence?
service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
Outdated
Show resolved
Hide resolved
service/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalog.java
Show resolved
Hide resolved
...e/common/src/main/java/org/apache/polaris/service/catalog/iceberg/IcebergCatalogHandler.java
Show resolved
Hide resolved
This looks like a huge change (conceptually). Could you open a dev ML thread on this? I did not review in full, but I believe some aspects would be nice to discuss by email (GH comments are not always convenient). |
"If nonzero, the approximate max size a table's metadata can be in order to be cached in the persistence" | ||
+ " layer. If zero, no metadata will be cached or served from the cache. If -1, all metadata" | ||
+ " will be cached.") | ||
.defaultValue(Constants.METADATA_CACHE_MAX_BYTES_NO_CACHING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker: It's reasonable to turn it off by default. Do we have a recommended size so that users/admins could use it without additional evaluation work?
!= FeatureConfiguration.Constants.METADATA_CACHE_MAX_BYTES_INFINITE_CACHING) { | ||
if (metadataString.length() * 2 > maxBytesToCache) { | ||
LOGGER.debug( | ||
String.format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see multiple occurrences of this pattern in the PR: LOGGER.debug(String.format(...))
. That's going to be a problem given that, no matter whether debug logs are enabled, the log message will be computed and allocated on the heap. Could you switch to the placeholders pattern?
logger.debug("Will not cache metadata for {}; metadata above the limit of {} bytes", tableLikeEntity.getTableIdentifier(), maxBytesToCache);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Checkstyle rule we have on Apache Iceberg for this,
<module name="RegexpSinglelineJava">
<property name="format" value="(?i)log(ger)?\.(debug|info|warn|error)\(.*%[sd]"/>
<property name="message" value="SLF4J loggers support '{}' style formatting."/>
<property name="ignoreComments" value="true"/>
</module>
String metadataString = TableMetadataParser.toJson(tableMetadata); | ||
if (maxBytesToCache | ||
!= FeatureConfiguration.Constants.METADATA_CACHE_MAX_BYTES_INFINITE_CACHING) { | ||
if (metadataString.length() * 2 > maxBytesToCache) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this place use the constant APPROXIMATE_BYTES_PER_CHAR instead of * 2?
Description
This adds a new flag
METADATA_CACHE_MAX_BYTES
which allows the catalog to store table metadata in the metastore and vend it from there when loadTable is called.Entries are cached based on the metadata location. Currently, the entire metadata.json content is cached.
Features not included in this PR:
There is partial support for (1) here and I want to extend it, but the goal is to structure things in a way that will allow us to implement (2) and (3) in the future as well.
Performance
I added a new benchmark here and collected results with and without caching enabled. For these results, I used a very narrow distribution that repeatedly hit the same few tables.
Without metadata persistence:
With metadata persistence:
How Has This Been Tested?
Existing tests vend table metadata correctly when caching is enabled.
Added a small test in
BasePolarisCatalogTest
to cover the basic semantics of cachingManual testing with eclipselink -- I observed the entities getting created in Postgres and saw large metadata being cached:
With MySQL, small metadata is persisted:
However large metadata may cause
internalproperties
to exceed the size limit and nothing will be cached. Calls still return safely.