-
Notifications
You must be signed in to change notification settings - Fork 387
Docs for Metadata Search #9300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs for Metadata Search #9300
Conversation
📚 Documentation preview at https://pr-9300.docs-lakefs-preview.io/ (Updated: 7/18/2025, 11:50:36 AM - Commit: b7a5de8) |
Available in **lakeFS Enterprise** | ||
|
||
!!! tip | ||
lakeFS Metadata search is currently in private preview for [lakeFS Enterprise](../../enterprise/index.md) customers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And cloud? or does enterprise include cloud?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory it includes both, the link is to a features page that describe all enterprise features together
|
||
### Object Metadata Table Schema | ||
|
||
Each row in the lakeFS object metadata table represents the latest metadata for an object on the branch the table corresponds to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what latest means here, it's the information that exists on the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's explained two rows down, so maybe it's fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it a bit to add clarity to what latest means
!!! info | ||
lakeFS object metadata tables are eventually consistent, which means it may take up to a few minutes for newly committed | ||
objects to become searchable. Metadata becomes searchable **atomically** — either all object metadata from the commit | ||
is available, or none of it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's worth mentioning but we preserve the order that is: commit processed
-> commit parent processed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, I added it
|
||
# Create a tag pointing to current HEAD | ||
tag = lakefs.Tag(repo.id, "v1.2").create(main_branch.id) | ||
tag_commit_id = tag.get_commit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
tag_commit_id = tag.get_commit() | |
tag_commit_id = tag.get_commit().id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validated it and done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM just small adjustments
|
||
With Metadata Search, you can query both: | ||
|
||
* **System metadata**: Automatically captured properties such as object path, size, last modified time, and committer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also capture the object Content Type, though I'm not sure how we infer it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx, these are just examples here
## How it Works | ||
|
||
lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog), | ||
nd uses catalog-level system tables to manage and expose versioned object metadata for querying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nd uses catalog-level system tables to manage and expose versioned object metadata for querying. | |
and uses catalog-level system tables to manage and expose versioned object metadata for querying. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
```sql | ||
USE "<repo>-metadata.<branch>.system"; | ||
SELECT * FROM object_metadata | ||
WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at | ||
LIMIT 1; | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The snippet isn't rendered correctly, should be:
```sql | |
USE "<repo>-metadata.<branch>.system"; | |
SELECT * FROM object_metadata | |
WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at | |
LIMIT 1; | |
``` | |
```sql | |
USE "<repo>-metadata.<branch>.system"; | |
SELECT * FROM object_metadata | |
WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at | |
LIMIT 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preview shows it is rendered correctly, what am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great and I love the depth and examples. Things I believe should be considered:
- navigation (don't nest this doc too deeply!)
- consistency being a section and not a callout
- perhaps explaining how to use branches vs commits vs tags a little more clearly. I'm slightly confused atm so I imagine others might also be?
@@ -66,6 +66,16 @@ Where there is data, there is also metadata. lakeFS uses metadata to define sche | |||
## Merge | |||
lakeFS merge command, similar to the Git merge functionality, allows you to merge data branches. Once you commit data, you can review it and then merge the committed data into the target branch. A merge generates a commit on the target branch with all your changes. lakeFS guarantees atomic merges that are fast, given they don’t involve copying data. [Read More][merge]. | |||
|
|||
## Object Metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
docs/mkdocs.yml
Outdated
@@ -143,6 +143,8 @@ nav: | |||
- Work with Data locally: howto/local-checkouts.md | |||
- Sizing Guide: howto/sizing-guide.md | |||
- Data Management: | |||
- Metadata: | |||
- Metadata search: datamanagment/metadata/metadata-search.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the nesting? I'd put Metadata Search
directly below Data Management
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is with future thought about other upcoming metadata capabilities but i'm totally fine with flattening it - done
!!! info | ||
Available in **lakeFS Enterprise** | ||
|
||
!!! tip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
!!! tip | |
!!! note |
lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog), | ||
nd uses catalog-level system tables to manage and expose versioned object metadata for querying. | ||
|
||
For every searchable repository and branch (see [configuration](#configuring-metadata-search) for more info), lakeFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing: you say that it's for every repository and branch (and also show this in the next line as the convention (i.e. ..metadata.<branch>.system..
) but later you say that commits and tags are also supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, I changed the section to talk more about how it works and removed the part that talks about immutable references from this section. I'm keeping it to the writing reproducible queries section.
(See [Writing reproducible queries](#writing-reproducible-queries) for more on querying by different reference types.) | ||
|
||
!!! info | ||
You can use Metadata Search even if you’re not licensed for full lakeFS Iceberg support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say I buy metadata search but not Iceberg REST - will I have a REST catalog that only allows reading metadata and not creating other tables in it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be backed by licensing restriction in the future. for now it doesn't work like it
### Writing Reproducible Queries | ||
|
||
When you search metadata on a branch, the results reflect the state of the branch’s HEAD commit at query time, provided | ||
the metadata has been ingested (eventual consistency applies). However, since a branch’s HEAD is mutable, it moves forward |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if consistency is a section, you can link to it ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
To search metadata associated with a specific tag: | ||
1. Retrieve the commit ID the tag points to. | ||
2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids-). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids-). | |
2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure? intellij forces me use - as a prefix
|
||
```sql | ||
USE "repo-metadata.commit-dc3117ec3a727104226c896bf7ab9350ee5da06ae052406262840e9a4a8c9ffb.system"; | ||
SHOW TABLES; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it isn't, removed
## Limitations | ||
|
||
* Applies only to objects added or modified after the feature was enabled. Existing objects before that point are not indexed. | ||
* No direct commit and tag support: To query by commit or tag, see the [writing reproducible queries](#writing-reproducible-queries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "direct commit"? earlier you showed an example of querying by commit ID, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to query by commit id you need to prefix it with a commit-
prefix. I changed this and will try to add clarity to the reproducible queries section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great illustration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I think the part I'm missing is the permissions/auth. explain the user which lakeFS permissions and to which repository. also the part that the client that performs the query will need access to the underlying storage that found under the lakefs repository storage namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it be possible to add this one as SVG or JPEG? as png it will take a very large size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, changed to svg
* **User-defined metadata**: Custom labels, annotations, or tags stored as lakeFS object metadata — typically added during | ||
ingestion, processing, or curation. | ||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need move the image to the How it works section later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had the dilemma but my intent here is to use higher-level digram that describes the feature and is not 100% accurate for simplicity. for example the queried repo name incorrect and so do table schema and records. WDYT?
## Benefits | ||
|
||
* **Scalable**: Search metadata across millions or billions of objects. | ||
* **Query Reproducibility**: Run metadata queries against specific commits or tags for consistent results. | ||
* **No infrastructure burden**: lakeFS manages metadata collection and indexing natively: no need to build, deploy or | ||
maintain a separate metadata tracking system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are lakeFS specific implementation benefits? when I read it first I thought these are 'metadata search' benefits, but a lot of them are listed as use-cases, like data discoverability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The distinction between benefits and use cases is that benefits describe the value the feature provides across scenarios. For metadata search, this includes scalability, reproducibility, and automation. Use cases explain how the feature is applied in practice, along with the impact on each workflow. The same benefits apply across all listed use cases. Makes sense?
|
||
!!! info | ||
You can use Metadata Search even if you’re not licensed for full lakeFS Iceberg support. | ||
If you're already using another Iceberg REST catalog, you don’t need to switch — metadata search will still work using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't understand this one. Metadata search can use other Iceberg REST catalog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant to say (and is probably not clear enough) is that if you are already using Iceberg you don't need to use lakeFS with your self created Iceberg tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the first part "If you're already using another Iceberg REST catalog, you don’t need to switch — ..." it sounds like we try to explain the user doesn't have to switch and we can use its catalog.
Maybe it is the word "switch" that bothers me.
|
||
## How it Works | ||
|
||
lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the lakeFS Iceberg REST catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
TODO | ||
* metadata server configurations | ||
* lakeFS server | ||
* Searchable repos and branches | ||
* Iceberg catalog configurations? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to remove/update
|
||
To search by object metadata in lakeFS, you query the Iceberg object metadata tables lakeFS creates and manages. | ||
|
||
Object metadata tables are standard Iceberg tables, meaning, you can query it using any Iceberg-compatible engine, including |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Object metadata is stored in Iceberg table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
To search by object metadata in lakeFS, you query the Iceberg object metadata tables lakeFS creates and manages. | ||
|
||
Object metadata tables are standard Iceberg tables, meaning, you can query it using any Iceberg-compatible engine, including | ||
DuckDB, Trino, Spark, PyIceberg, or else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, or others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
2. Load the object metadata table that represents the reference you would like to query. | ||
3. Use SQL to search by system or user-defined metadata. | ||
|
||
Here’s an example using PyIceberg and DuckDB: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that the code doesn't import duckdb, I assume it requires the duckdb package to be installed for this integration. consider note about the packages to install in order for the code to run or create one example with python code that uses pyiceberg and one that uses duckdb directly.
'credential': f'AKIAlakefs12345EXAMPLE:abc/lakefs/1234567bPxRfiCYEXAMPLEKEY', | ||
}) | ||
|
||
con = catalog.load_table('repo-metadata.main.system.object_metadata').scan().to_duckdb('object_metadata') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment that 'repo' stand for the repository name we like to search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! The only thing I'm worried about is the structure of the config file - approving because this is orthogonal to merging the docs :)
max_commits: 100 | ||
repositories: | ||
"example-repo-1": | ||
- "main" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not docs related - it's a design comment which I feel strongly about:
The heirarchy here is too implicit and will not support tags and arbitrary commits.
I suggest:
...
repositories:
example-repo-1:
branches:
- main
- dev
This would later allow adding tags
or commits
while being backwards compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will follow up on this one asap.
Here’s an example using PyIceberg and DuckDB: | ||
|
||
!!! requirements | ||
This requires duckdb to be installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe refer to their docs/installation guide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
Querying metadata tables using a branch name, e.g., `repo-metadata.main.system.object_metadata` return results based on | ||
the state of the branch’s HEAD commit at the time of the query, assuming the metadata has already been ingested (within | ||
[eventual consistency](#consistency-) constraints). However, because branch heads are mutable and advance with each new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[eventual consistency](#consistency-) constraints). However, because branch heads are mutable and advance with each new | |
[eventual consistency](#consistency) constraints). However, because branch heads are mutable and advance with each new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Part of https://github.com/treeverse/lakeFS-Enterprise/issues/455
This pr includes the docs for Metadata search except the Configuration section that will be added tomorrow.
Please review although it is marked as a draft.