Docs for Metadata Search #9300

talSofer · 2025-07-15T09:05:44Z

Part of https://github.com/treeverse/lakeFS-Enterprise/issues/455

This pr includes the docs for Metadata search except the Configuration section that will be added tomorrow.

Please review although it is marked as a draft.

github-actions · 2025-07-15T09:06:19Z

📚 Documentation preview at https://pr-9300.docs-lakefs-preview.io/

(Updated: 7/18/2025, 11:50:36 AM - Commit: b7a5de8)

guy-har · 2025-07-16T09:47:05Z

docs/src/datamanagment/metadata/metadata-search.md

+    Available in **lakeFS Enterprise**
+
+!!! tip
+    lakeFS Metadata search is currently in private preview for [lakeFS Enterprise](../../enterprise/index.md) customers.


And cloud? or does enterprise include cloud?

In theory it includes both, the link is to a features page that describe all enterprise features together

guy-har · 2025-07-16T09:53:54Z

docs/src/datamanagment/metadata/metadata-search.md

+
+### Object Metadata Table Schema
+
+Each row in the lakeFS object metadata table represents the latest metadata for an object on the branch the table corresponds to.


Not sure what latest means here, it's the information that exists on the reference.

It's explained two rows down, so maybe it's fine

changed it a bit to add clarity to what latest means

guy-har · 2025-07-16T09:59:37Z

docs/src/datamanagment/metadata/metadata-search.md

+!!! info
+    lakeFS object metadata tables are eventually consistent, which means it may take up to a few minutes for newly committed 
+    objects to become searchable. Metadata becomes searchable **atomically** — either all object metadata from the commit 
+    is available, or none of it is. 


Not sure if it's worth mentioning but we preserve the order that is: commit processed -> commit parent processed

thx, I added it

guy-har · 2025-07-16T10:06:30Z

docs/src/datamanagment/metadata/metadata-search.md

+
+# Create a tag pointing to current HEAD
+tag = lakefs.Tag(repo.id, "v1.2").create(main_branch.id)
+tag_commit_id = tag.get_commit()


I think

Suggested change

tag_commit_id = tag.get_commit()

tag_commit_id = tag.get_commit().id

validated it and done

AliRamberg

LGTM just small adjustments

AliRamberg · 2025-07-16T11:40:43Z

docs/src/datamanagment/metadata/metadata-search.md

+
+With Metadata Search, you can query both:
+
+* **System metadata**: Automatically captured properties such as object path, size, last modified time, and committer.


We also capture the object Content Type, though I'm not sure how we infer it

thx, these are just examples here

AliRamberg · 2025-07-16T11:58:32Z

docs/src/datamanagment/metadata/metadata-search.md

+## How it Works
+
+lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog),
+nd uses catalog-level system tables to manage and expose versioned object metadata for querying.


Suggested change

nd uses catalog-level system tables to manage and expose versioned object metadata for querying.

and uses catalog-level system tables to manage and expose versioned object metadata for querying.

AliRamberg · 2025-07-16T12:19:01Z

docs/src/datamanagment/metadata/metadata-search.md

+    ```sql
+    USE "<repo>-metadata.<branch>.system";
+    SELECT * FROM object_metadata
+    WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at 
+    LIMIT 1;
+    ```


The snippet isn't rendered correctly, should be:

Suggested change

```sql

USE "<repo>-metadata.<branch>.system";

SELECT * FROM object_metadata

WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at

LIMIT 1;

```

```sql

USE "<repo>-metadata.<branch>.system";

SELECT * FROM object_metadata

WHERE commit_id = <head_commit> -- Replace with the head commit ID of the branch you are looking at

LIMIT 1;

Preview shows it is rendered correctly, what am I missing?

ozkatz

Overall looks great and I love the depth and examples. Things I believe should be considered:

navigation (don't nest this doc too deeply!)
consistency being a section and not a callout
perhaps explaining how to use branches vs commits vs tags a little more clearly. I'm slightly confused atm so I imagine others might also be?

ozkatz · 2025-07-16T17:20:20Z

docs/src/understand/glossary.md

@@ -66,6 +66,16 @@ Where there is data, there is also metadata. lakeFS uses metadata to define sche
 ## Merge
 lakeFS merge command, similar to the Git merge functionality, allows you to merge data branches. Once you commit data, you can review it and then merge the committed data into the target branch. A merge generates a commit on the target branch with all your changes. lakeFS guarantees atomic merges that are fast, given they don’t involve copying data. [Read More][merge].

+## Object Metadata


ozkatz · 2025-07-16T17:21:12Z

docs/mkdocs.yml

@@ -143,6 +143,8 @@ nav:
    - Work with Data locally: howto/local-checkouts.md
    - Sizing Guide: howto/sizing-guide.md
  - Data Management:
+    - Metadata:
+        - Metadata search: datamanagment/metadata/metadata-search.md


Why the nesting? I'd put Metadata Search directly below Data Management

It is with future thought about other upcoming metadata capabilities but i'm totally fine with flattening it - done

ozkatz · 2025-07-16T17:21:38Z

docs/src/datamanagment/metadata/metadata-search.md

+!!! info
+    Available in **lakeFS Enterprise**
+
+!!! tip


Suggested change

!!! tip

!!! note

ozkatz · 2025-07-16T17:25:28Z

docs/src/datamanagment/metadata/metadata-search.md

+lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog),
+nd uses catalog-level system tables to manage and expose versioned object metadata for querying.
+
+For every searchable repository and branch (see [configuration](#configuring-metadata-search) for more info), lakeFS 


This is a bit confusing: you say that it's for every repository and branch (and also show this in the next line as the convention (i.e. ..metadata.<branch>.system..) but later you say that commits and tags are also supported.

agreed, I changed the section to talk more about how it works and removed the part that talks about immutable references from this section. I'm keeping it to the writing reproducible queries section.

ozkatz · 2025-07-16T17:26:41Z

docs/src/datamanagment/metadata/metadata-search.md

+(See [Writing reproducible queries](#writing-reproducible-queries) for more on querying by different reference types.)
+
+!!! info
+    You can use Metadata Search even if you’re not licensed for full lakeFS Iceberg support.


Say I buy metadata search but not Iceberg REST - will I have a REST catalog that only allows reading metadata and not creating other tables in it?

it will be backed by licensing restriction in the future. for now it doesn't work like it

ozkatz · 2025-07-16T17:57:25Z

docs/src/datamanagment/metadata/metadata-search.md

+### Writing Reproducible Queries
+
+When you search metadata on a branch, the results reflect the state of the branch’s HEAD commit at query time, provided 
+the metadata has been ingested (eventual consistency applies). However, since a branch’s HEAD is mutable, it moves forward


if consistency is a section, you can link to it ;)

ozkatz · 2025-07-16T17:58:38Z

docs/src/datamanagment/metadata/metadata-search.md

+
+To search metadata associated with a specific tag:
+1. Retrieve the commit ID the tag points to.
+2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids-).


Suggested change

2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids-).

2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids).

are you sure? intellij forces me use - as a prefix

ozkatz · 2025-07-16T18:01:18Z

docs/src/datamanagment/metadata/metadata-search.md

+
+```sql
+USE "repo-metadata.commit-dc3117ec3a727104226c896bf7ab9350ee5da06ae052406262840e9a4a8c9ffb.system";
+SHOW TABLES;


why is this needed?

it isn't, removed

ozkatz · 2025-07-16T18:02:57Z

docs/src/datamanagment/metadata/metadata-search.md

+## Limitations
+
+* Applies only to objects added or modified after the feature was enabled. Existing objects before that point are not indexed. 
+* No direct commit and tag support: To query by commit or tag, see the [writing reproducible queries](#writing-reproducible-queries)


What do you mean by "direct commit"? earlier you showed an example of querying by commit ID, no?

to query by commit id you need to prefix it with a commit- prefix. I changed this and will try to add clarity to the reproducible queries section

ozkatz · 2025-07-16T18:03:17Z

docs/src/assets/img/mds/mds_how_it_works.png

Great illustration

nopcoder

Looks good, I think the part I'm missing is the permissions/auth. explain the user which lakeFS permissions and to which repository. also the part that the client that performs the query will need access to the underlying storage that found under the lakefs repository storage namespace.

nopcoder · 2025-07-16T12:39:05Z

docs/src/assets/img/mds/mds_how_it_works.png

will it be possible to add this one as SVG or JPEG? as png it will take a very large size.

done, changed to svg

nopcoder · 2025-07-16T13:06:31Z

docs/src/datamanagment/metadata/metadata-search.md

+* **User-defined metadata**: Custom labels, annotations, or tags stored as lakeFS object metadata — typically added during
+ingestion, processing, or curation.
+
+![metadata search](../../assets/img/mds/mds_how_it_works.png)


Need move the image to the How it works section later?

I also had the dilemma but my intent here is to use higher-level digram that describes the feature and is not 100% accurate for simplicity. for example the queried repo name incorrect and so do table schema and records. WDYT?

nopcoder · 2025-07-16T17:25:01Z

docs/src/datamanagment/metadata/metadata-search.md

+## Benefits
+
+* **Scalable**: Search metadata across millions or billions of objects.
+* **Query Reproducibility**: Run metadata queries against specific commits or tags for consistent results.
+* **No infrastructure burden**: lakeFS manages metadata collection and indexing natively: no need to build, deploy or 
+maintain a separate metadata tracking system.


These are lakeFS specific implementation benefits? when I read it first I thought these are 'metadata search' benefits, but a lot of them are listed as use-cases, like data discoverability.

The distinction between benefits and use cases is that benefits describe the value the feature provides across scenarios. For metadata search, this includes scalability, reproducibility, and automation. Use cases explain how the feature is applied in practice, along with the impact on each workflow. The same benefits apply across all listed use cases. Makes sense?

nopcoder · 2025-07-16T17:29:54Z

docs/src/datamanagment/metadata/metadata-search.md

+
+!!! info
+    You can use Metadata Search even if you’re not licensed for full lakeFS Iceberg support.
+    If you're already using another Iceberg REST catalog, you don’t need to switch — metadata search will still work using 


Don't understand this one. Metadata search can use other Iceberg REST catalog?

What I meant to say (and is probably not clear enough) is that if you are already using Iceberg you don't need to use lakeFS with your self created Iceberg tables.

From the first part "If you're already using another Iceberg REST catalog, you don’t need to switch — ..." it sounds like we try to explain the user doesn't have to switch and we can use its catalog.
Maybe it is the word "switch" that bothers me.

nopcoder · 2025-07-16T17:32:38Z

docs/src/datamanagment/metadata/metadata-search.md

+
+## How it Works
+
+lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog),


use the lakeFS Iceberg REST catalog

nopcoder · 2025-07-16T17:41:24Z

docs/src/datamanagment/metadata/metadata-search.md

+TODO
+* metadata server configurations
+  * lakeFS server
+  * Searchable repos and branches
+* Iceberg catalog configurations?


Don't forget to remove/update

nopcoder · 2025-07-16T17:48:00Z

docs/src/datamanagment/metadata/metadata-search.md

+
+To search by object metadata in lakeFS, you query the Iceberg object metadata tables lakeFS creates and manages.
+
+Object metadata tables are standard Iceberg tables, meaning, you can query it using any Iceberg-compatible engine, including


Object metadata is stored in Iceberg table

nopcoder · 2025-07-16T17:48:16Z

docs/src/datamanagment/metadata/metadata-search.md

+To search by object metadata in lakeFS, you query the Iceberg object metadata tables lakeFS creates and manages.
+
+Object metadata tables are standard Iceberg tables, meaning, you can query it using any Iceberg-compatible engine, including
+DuckDB, Trino, Spark, PyIceberg, or else.


, or others.

nopcoder · 2025-07-16T20:13:15Z

docs/src/datamanagment/metadata/metadata-search.md

+2. Load the object metadata table that represents the reference you would like to query.
+3. Use SQL to search by system or user-defined metadata.
+
+Here’s an example using PyIceberg and DuckDB: 


note that the code doesn't import duckdb, I assume it requires the duckdb package to be installed for this integration. consider note about the packages to install in order for the code to run or create one example with python code that uses pyiceberg and one that uses duckdb directly.

nopcoder · 2025-07-16T20:14:07Z

docs/src/datamanagment/metadata/metadata-search.md

+    'credential': f'AKIAlakefs12345EXAMPLE:abc/lakefs/1234567bPxRfiCYEXAMPLEKEY',
+})
+
+con = catalog.load_table('repo-metadata.main.system.object_metadata').scan().to_duckdb('object_metadata')


comment that 'repo' stand for the repository name we like to search.

ozkatz

Looks good! The only thing I'm worried about is the structure of the config file - approving because this is orthogonal to merging the docs :)

ozkatz · 2025-07-17T18:46:05Z

docs/src/datamanagment/metadata-search.md

+      max_commits: 100
+      repositories:
+        "example-repo-1":
+          - "main"


This is not docs related - it's a design comment which I feel strongly about:
The heirarchy here is too implicit and will not support tags and arbitrary commits.
I suggest:

... repositories: example-repo-1: branches: - main - dev

This would later allow adding tags or commits while being backwards compatible.

Thanks, will follow up on this one asap.

ozkatz · 2025-07-17T18:46:56Z

docs/src/datamanagment/metadata-search.md

+Here’s an example using PyIceberg and DuckDB:
+
+!!! requirements 
+    This requires duckdb to be installed. 


maybe refer to their docs/installation guide

ozkatz · 2025-07-17T18:47:39Z

docs/src/datamanagment/metadata-search.md

+
+Querying metadata tables using a branch name, e.g., `repo-metadata.main.system.object_metadata` return results based on
+the state of the branch’s HEAD commit at the time of the query, assuming the metadata has already been ingested (within 
+[eventual consistency](#consistency-) constraints). However, because branch heads are mutable and advance with each new


Suggested change

[eventual consistency](#consistency-) constraints). However, because branch heads are mutable and advance with each new

[eventual consistency](#consistency) constraints). However, because branch heads are mutable and advance with each new

talSofer added 14 commits July 3, 2025 11:01

plumbing for metadata search docs

7d2ff17

relocate docs to new layout dir structure

b3499a7

partial docs

36d4e39

how it works

e94ee28

partial

057f58a

another part

ae3e698

another part

5907eab

how to use refrences

965b66a

how to search

a4bedb5

example queries

92d2e13

corrections

152c070

add object metadata to glossary

2ceecd5

corrections

28902d7

move to metadata directory

70bb1b2

talSofer requested review from nopcoder, ozkatz, AliRamberg and guy-har July 15, 2025 09:05

talSofer added exclude-changelog PR description should not be included in next release changelog docs Improvements or additions to documentation labels Jul 15, 2025

talSofer added 2 commits July 15, 2025 12:09

fix menu display

3d01d64

fix bullet rendering

e5dc2d8

guy-har approved these changes Jul 16, 2025

View reviewed changes

AliRamberg reviewed Jul 16, 2025

View reviewed changes

ozkatz requested changes Jul 16, 2025

View reviewed changes

nopcoder requested changes Jul 16, 2025

View reviewed changes

talSofer added 3 commits July 17, 2025 09:55

config

05f622b

config

bc52cd1

link to thank you page

2b5d384

talSofer added 4 commits July 17, 2025 10:47

fix example formatting

caeea5b

fix example formatting

d02c32d

fix config section formatting

251584c

address most review comments

73e9353

talSofer marked this pull request as ready for review July 17, 2025 09:47

talSofer added 4 commits July 17, 2025 13:00

png to svg

7ff13f9

fix object metadata term formatting

8faa21e

change img

1dc0d4f

fix review comments

f596f68

talSofer requested review from nopcoder and ozkatz July 17, 2025 13:00

talSofer added 2 commits July 17, 2025 16:13

simplify example query

aa4464c

revised how it works

9a31fa6

talSofer added the minor-change Used for PRs that don't require issue attached label Jul 17, 2025

clarify how to use commit ids

297ba12

ozkatz approved these changes Jul 17, 2025

View reviewed changes

fix final cr comments

b7a5de8

nopcoder approved these changes Jul 18, 2025

View reviewed changes

talSofer merged commit 2cb42bc into master Jul 18, 2025
42 checks passed

talSofer deleted the docs/metadata-search branch July 18, 2025 11:50


		### Object Metadata Table Schema

		Each row in the lakeFS object metadata table represents the latest metadata for an object on the branch the table corresponds to.

	tag_commit_id = tag.get_commit()
	tag_commit_id = tag.get_commit().id


		With Metadata Search, you can query both:

		* System metadata: Automatically captured properties such as object path, size, last modified time, and committer.

	nd uses catalog-level system tables to manage and expose versioned object metadata for querying.
	and uses catalog-level system tables to manage and expose versioned object metadata for querying.

	2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids-).
	2. Use the commit-based pattern described in [Using Commit IDs](#using-commit-ids).


		## How it Works

		lakeFS Metadata Search is built on top of [lakeFS Iceberg support](../../integrations/iceberg.md#what-is-lakefs-iceberg-rest-catalog),


		To search by object metadata in lakeFS, you query the Iceberg object metadata tables lakeFS creates and manages.

		Object metadata tables are standard Iceberg tables, meaning, you can query it using any Iceberg-compatible engine, including

	[eventual consistency](#consistency-) constraints). However, because branch heads are mutable and advance with each new
	[eventual consistency](#consistency) constraints). However, because branch heads are mutable and advance with each new

Docs for Metadata Search #9300

Docs for Metadata Search #9300

Uh oh!

Conversation

talSofer commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AliRamberg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ozkatz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

talSofer commented Jul 15, 2025 •

edited

Loading

github-actions bot commented Jul 15, 2025 •

edited

Loading