Skip to content

Docs for Metadata Search #9300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Jul 18, 2025
Merged

Docs for Metadata Search #9300

merged 31 commits into from
Jul 18, 2025

Conversation

talSofer
Copy link
Contributor

@talSofer talSofer commented Jul 15, 2025

Part of https://github.com/treeverse/lakeFS-Enterprise/issues/455

This pr includes the docs for Metadata search except the Configuration section that will be added tomorrow.

Please review although it is marked as a draft.

Copy link

github-actions bot commented Jul 15, 2025

📚 Documentation preview at https://pr-9300.docs-lakefs-preview.io/

(Updated: 7/18/2025, 11:50:36 AM - Commit: b7a5de8)

@talSofer talSofer added exclude-changelog PR description should not be included in next release changelog docs Improvements or additions to documentation labels Jul 15, 2025
Available in **lakeFS Enterprise**

!!! tip
lakeFS Metadata search is currently in private preview for [lakeFS Enterprise](../../enterprise/index.md) customers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And cloud? or does enterprise include cloud?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory it includes both, the link is to a features page that describe all enterprise features together


### Object Metadata Table Schema

Each row in the lakeFS object metadata table represents the latest metadata for an object on the branch the table corresponds to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what latest means here, it's the information that exists on the reference.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's explained two rows down, so maybe it's fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it a bit to add clarity to what latest means

Comment on lines 89 to 92
!!! info
lakeFS object metadata tables are eventually consistent, which means it may take up to a few minutes for newly committed
objects to become searchable. Metadata becomes searchable **atomically** — either all object metadata from the commit
is available, or none of it is.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it's worth mentioning but we preserve the order that is: commit processed -> commit parent processed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, I added it


# Create a tag pointing to current HEAD
tag = lakefs.Tag(repo.id, "v1.2").create(main_branch.id)
tag_commit_id = tag.get_commit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

Suggested change
tag_commit_id = tag.get_commit()
tag_commit_id = tag.get_commit().id

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validated it and done

Copy link
Contributor

@AliRamberg AliRamberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM just small adjustments


With Metadata Search, you can query both:

* **System metadata**: Automatically captured properties such as object path, size, last modified time, and committer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also capture the object Content Type, though I'm not sure how we infer it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, these are just examples here

## How it Works

lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog),
nd uses catalog-level system tables to manage and expose versioned object metadata for querying.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
nd uses catalog-level system tables to manage and expose versioned object metadata for querying.
and uses catalog-level system tables to manage and expose versioned object metadata for querying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 97 to 102
```sql
USE "<repo>-metadata.<branch>.system";
SELECT * FROM object_metadata
WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at
LIMIT 1;
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The snippet isn't rendered correctly, should be:

Suggested change
```sql
USE "<repo>-metadata.<branch>.system";
SELECT * FROM object_metadata
WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at
LIMIT 1;
```
```sql
USE "<repo>-metadata.<branch>.system";
SELECT * FROM object_metadata
WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at
LIMIT 1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Preview shows it is rendered correctly, what am I missing?

Copy link
Collaborator

@ozkatz ozkatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great and I love the depth and examples. Things I believe should be considered:

  1. navigation (don't nest this doc too deeply!)
  2. consistency being a section and not a callout
  3. perhaps explaining how to use branches vs commits vs tags a little more clearly. I'm slightly confused atm so I imagine others might also be?

@@ -66,6 +66,16 @@ Where there is data, there is also metadata. lakeFS uses metadata to define sche
## Merge
lakeFS merge command, similar to the Git merge functionality, allows you to merge data branches. Once you commit data, you can review it and then merge the committed data into the target branch. A merge generates a commit on the target branch with all your changes. lakeFS guarantees atomic merges that are fast, given they don’t involve copying data. [Read More][merge].

## Object Metadata
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏

docs/mkdocs.yml Outdated
@@ -143,6 +143,8 @@ nav:
- Work with Data locally: howto/local-checkouts.md
- Sizing Guide: howto/sizing-guide.md
- Data Management:
- Metadata:
- Metadata search: datamanagment/metadata/metadata-search.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the nesting? I'd put Metadata Search directly below Data Management

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is with future thought about other upcoming metadata capabilities but i'm totally fine with flattening it - done

!!! info
Available in **lakeFS Enterprise**

!!! tip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
!!! tip
!!! note

lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog),
nd uses catalog-level system tables to manage and expose versioned object metadata for querying.

For every searchable repository and branch (see [configuration](#configuring-metadata-search) for more info), lakeFS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing: you say that it's for every repository and branch (and also show this in the next line as the convention (i.e. ..metadata.<branch>.system..) but later you say that commits and tags are also supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I changed the section to talk more about how it works and removed the part that talks about immutable references from this section. I'm keeping it to the writing reproducible queries section.

(See [Writing reproducible queries](#writing-reproducible-queries) for more on querying by different reference types.)

!!! info
You can use Metadata Search even if you’re not licensed for full lakeFS Iceberg support.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say I buy metadata search but not Iceberg REST - will I have a REST catalog that only allows reading metadata and not creating other tables in it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be backed by licensing restriction in the future. for now it doesn't work like it

### Writing Reproducible Queries

When you search metadata on a branch, the results reflect the state of the branch’s HEAD commit at query time, provided
the metadata has been ingested (eventual consistency applies). However, since a branch’s HEAD is mutable, it moves forward
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if consistency is a section, you can link to it ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


To search metadata associated with a specific tag:
1. Retrieve the commit ID the tag points to.
2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids-).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids-).
2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure? intellij forces me use - as a prefix


```sql
USE "repo-metadata.commit-dc3117ec3a727104226c896bf7ab9350ee5da06ae052406262840e9a4a8c9ffb.system";
SHOW TABLES;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it isn't, removed

## Limitations

* Applies only to objects added or modified after the feature was enabled. Existing objects before that point are not indexed.
* No direct commit and tag support: To query by commit or tag, see the [writing reproducible queries](#writing-reproducible-queries)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "direct commit"? earlier you showed an example of querying by commit ID, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to query by commit id you need to prefix it with a commit- prefix. I changed this and will try to add clarity to the reproducible queries section

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great illustration

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I think the part I'm missing is the permissions/auth. explain the user which lakeFS permissions and to which repository. also the part that the client that performs the query will need access to the underlying storage that found under the lakefs repository storage namespace.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it be possible to add this one as SVG or JPEG? as png it will take a very large size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, changed to svg

* **User-defined metadata**: Custom labels, annotations, or tags stored as lakeFS object metadata — typically added during
ingestion, processing, or curation.

![metadata search](../../assets/img/mds/mds_how_it_works.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need move the image to the How it works section later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had the dilemma but my intent here is to use higher-level digram that describes the feature and is not 100% accurate for simplicity. for example the queried repo name incorrect and so do table schema and records. WDYT?

Comment on lines 34 to 39
## Benefits

* **Scalable**: Search metadata across millions or billions of objects.
* **Query Reproducibility**: Run metadata queries against specific commits or tags for consistent results.
* **No infrastructure burden**: lakeFS manages metadata collection and indexing natively: no need to build, deploy or
maintain a separate metadata tracking system.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are lakeFS specific implementation benefits? when I read it first I thought these are 'metadata search' benefits, but a lot of them are listed as use-cases, like data discoverability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The distinction between benefits and use cases is that benefits describe the value the feature provides across scenarios. For metadata search, this includes scalability, reproducibility, and automation. Use cases explain how the feature is applied in practice, along with the impact on each workflow. The same benefits apply across all listed use cases. Makes sense?


!!! info
You can use Metadata Search even if you’re not licensed for full lakeFS Iceberg support.
If you're already using another Iceberg REST catalog, you don’t need to switch — metadata search will still work using
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't understand this one. Metadata search can use other Iceberg REST catalog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant to say (and is probably not clear enough) is that if you are already using Iceberg you don't need to use lakeFS with your self created Iceberg tables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the first part "If you're already using another Iceberg REST catalog, you don’t need to switch — ..." it sounds like we try to explain the user doesn't have to switch and we can use its catalog.
Maybe it is the word "switch" that bothers me.


## How it Works

lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the lakeFS Iceberg REST catalog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 110 to 114
TODO
* metadata server configurations
* lakeFS server
* Searchable repos and branches
* Iceberg catalog configurations?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to remove/update


To search by object metadata in lakeFS, you query the Iceberg object metadata tables lakeFS creates and manages.

Object metadata tables are standard Iceberg tables, meaning, you can query it using any Iceberg-compatible engine, including
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Object metadata is stored in Iceberg table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

To search by object metadata in lakeFS, you query the Iceberg object metadata tables lakeFS creates and manages.

Object metadata tables are standard Iceberg tables, meaning, you can query it using any Iceberg-compatible engine, including
DuckDB, Trino, Spark, PyIceberg, or else.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, or others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

2. Load the object metadata table that represents the reference you would like to query.
3. Use SQL to search by system or user-defined metadata.

Here’s an example using PyIceberg and DuckDB:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that the code doesn't import duckdb, I assume it requires the duckdb package to be installed for this integration. consider note about the packages to install in order for the code to run or create one example with python code that uses pyiceberg and one that uses duckdb directly.

'credential': f'AKIAlakefs12345EXAMPLE:abc/lakefs/1234567bPxRfiCYEXAMPLEKEY',
})

con = catalog.load_table('repo-metadata.main.system.object_metadata').scan().to_duckdb('object_metadata')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment that 'repo' stand for the repository name we like to search.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@talSofer talSofer marked this pull request as ready for review July 17, 2025 09:47
@talSofer talSofer requested review from nopcoder and ozkatz July 17, 2025 13:00
@talSofer talSofer added the minor-change Used for PRs that don't require issue attached label Jul 17, 2025
Copy link
Collaborator

@ozkatz ozkatz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The only thing I'm worried about is the structure of the config file - approving because this is orthogonal to merging the docs :)

max_commits: 100
repositories:
"example-repo-1":
- "main"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not docs related - it's a design comment which I feel strongly about:
The heirarchy here is too implicit and will not support tags and arbitrary commits.
I suggest:

...
repositories:
  example-repo-1:
    branches:
      - main
      - dev

This would later allow adding tags or commits while being backwards compatible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will follow up on this one asap.

Here’s an example using PyIceberg and DuckDB:

!!! requirements
This requires duckdb to be installed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe refer to their docs/installation guide

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


Querying metadata tables using a branch name, e.g., `repo-metadata.main.system.object_metadata` return results based on
the state of the branch’s HEAD commit at the time of the query, assuming the metadata has already been ingested (within
[eventual consistency](#consistency-) constraints). However, because branch heads are mutable and advance with each new
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[eventual consistency](#consistency-) constraints). However, because branch heads are mutable and advance with each new
[eventual consistency](#consistency) constraints). However, because branch heads are mutable and advance with each new

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@talSofer talSofer merged commit 2cb42bc into master Jul 18, 2025
42 checks passed
@talSofer talSofer deleted the docs/metadata-search branch July 18, 2025 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Improvements or additions to documentation exclude-changelog PR description should not be included in next release changelog minor-change Used for PRs that don't require issue attached
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants