Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
5fd4443
:bug: remove code_version constructor argument from FeatureSpec
danielgafni Oct 30, 2025
ef1c602
fix: it looks like this field was missing/went missing
geoHeil Oct 31, 2025
9db71af
fix: refine implementation
geoHeil Oct 30, 2025
e477693
fix: type
geoHeil Oct 30, 2025
1a4d02e
chore: cleanup
geoHeil Oct 31, 2025
646f2ba
chore: cleanup
geoHeil Oct 31, 2025
0dcc8d3
fix: update snapshots after merge
geoHeil Oct 31, 2025
ac6db53
chore: cleanup
geoHeil Oct 31, 2025
905cfa6
chore: slides
geoHeil Oct 31, 2025
fe56b40
Update feature_spec.py
geoHeil Oct 31, 2025
c4e16f2
chore: rename
geoHeil Oct 31, 2025
20f58f9
fix: lint
geoHeil Oct 31, 2025
aefa0c0
chore: cleanup
geoHeil Oct 31, 2025
ba3117c
fix: refine implementation
geoHeil Oct 30, 2025
8d5862d
fix: type
geoHeil Oct 30, 2025
6f224bf
chore: cleanup
geoHeil Oct 31, 2025
41f0331
chore: cleanup
geoHeil Oct 31, 2025
baf3b9d
feat: #66 Add metadata parameter to FeatureSpec for user-defined info…
geoHeil Oct 29, 2025
724e1fa
fix: refine snapshots again with pre-commit hook
geoHeil Oct 29, 2025
6adb1c6
fix: snapshot newlines
geoHeil Oct 29, 2025
1280ef0
fix: claude
geoHeil Oct 30, 2025
8a68d6c
fix: upgrade CI
geoHeil Oct 30, 2025
0e3d6e7
fix: refine impl
geoHeil Oct 30, 2025
1bfb6fe
fix: lint
geoHeil Oct 31, 2025
369b0dd
chore: savepoint
geoHeil Oct 31, 2025
d578edb
fix: cleanup after rebase
geoHeil Oct 31, 2025
e2dbd3e
chore: rename, fix
geoHeil Oct 31, 2025
1e0b963
fix: tests
geoHeil Oct 31, 2025
858ac47
chore: refine
geoHeil Oct 31, 2025
855546a
chore: remove example
geoHeil Oct 31, 2025
7861e09
chore :refine docs
geoHeil Oct 31, 2025
c40b928
chore: remove diff
geoHeil Oct 31, 2025
75f5d39
fix: cleanup
geoHeil Oct 31, 2025
cd25a2f
fix: address reviewers comments
geoHeil Oct 31, 2025
8aa3a9d
fix: remove frozen dict
geoHeil Oct 31, 2025
7695224
fix: ensure FrozenBaseModel is used
geoHeil Oct 31, 2025
847c137
fix: refine
geoHeil Oct 31, 2025
128fffd
chore: k
geoHeil Oct 31, 2025
344d7a6
fix: update tests
geoHeil Oct 31, 2025
bab86a2
fix: refine implementation
geoHeil Oct 30, 2025
e18753e
chore: cleanup
geoHeil Oct 31, 2025
7ab3718
fix: refine implementation
geoHeil Oct 30, 2025
2e5b22e
chore: cleanup
geoHeil Oct 31, 2025
c2d48f4
feat: #66 Add metadata parameter to FeatureSpec for user-defined info…
geoHeil Oct 29, 2025
749120b
fix: refine snapshots again with pre-commit hook
geoHeil Oct 29, 2025
0ed2610
fix: snapshot newlines
geoHeil Oct 29, 2025
8dd43d8
fix: refine impl
geoHeil Oct 30, 2025
36a2b2e
fix: cleanup after rebase
geoHeil Oct 31, 2025
824682f
chore: remove diff
geoHeil Oct 31, 2025
ad65a96
fix: address reviewers comments
geoHeil Oct 31, 2025
b51c948
fix: ensure FrozenBaseModel is used
geoHeil Oct 31, 2025
19afc78
fix: refine implementation
geoHeil Oct 30, 2025
925d715
feat: re-enable type checker
geoHeil Oct 31, 2025
df6523e
chore: reduce permissions
geoHeil Oct 31, 2025
0a2ab79
fix: cleanup
geoHeil Oct 31, 2025
4d5ef8b
fix: ts?
geoHeil Oct 31, 2025
3de5948
fix: permissions
geoHeil Oct 31, 2025
e88bac1
fix: permissions
geoHeil Oct 31, 2025
64e360a
fix: ?
geoHeil Oct 31, 2025
5e19916
fix: disable cache one more
geoHeil Oct 31, 2025
976a391
fix: swap order
geoHeil Oct 31, 2025
e97295f
fix: simplify further; still nix challenges
geoHeil Oct 31, 2025
a29eaab
fix: revert for test
geoHeil Oct 31, 2025
8dd2ce8
fix: different glibc nix
geoHeil Oct 31, 2025
139658d
fix: k
geoHeil Oct 31, 2025
aec40b4
fix: non-interactive
geoHeil Oct 31, 2025
f84414f
fix: use default shell
geoHeil Oct 31, 2025
18cfe6b
fix: use basic
geoHeil Oct 31, 2025
c6d9fd3
fix: k
geoHeil Oct 31, 2025
1831819
fix: k
geoHeil Oct 31, 2025
7bf59d6
fix: k
geoHeil Oct 31, 2025
4b5bab4
fix: k
geoHeil Oct 31, 2025
f5d9e1a
fix: k
geoHeil Oct 31, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 39 additions & 34 deletions .github/workflows/QA.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs:
runs-on: depot-ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@v5
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
Expand All @@ -49,7 +49,7 @@ jobs:
runs-on: depot-ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@v5
- name: Install uv
uses: astral-sh/setup-uv@v7
with:
Expand All @@ -58,36 +58,39 @@ jobs:
with:
dprint-version: 0.50.2

# typecheck:
# runs-on: depot-ubuntu-latest
# env:
# UV_PYTHON_PREFERENCE: only-system
# steps:
# - name: Checkout code
# uses: actions/checkout@v4
# - name: Install Nix
# uses: cachix/install-nix-action@v27
# with:
# nix_path: nixpkgs=channel:nixpkgs-unstable
# - name: Setup Magic Nix Cache
# uses: DeterminateSystems/magic-nix-cache-action@v8
# - uses: nicknovitski/nix-develop@v1
# - name: Sync dependencies
# run: uv python pin 3.10 && uv sync --all-extras --all-groups
# - name: Replace bundled Node.js with Nix Node.js
# run: |
# # Find the bundled node binary and replace it with Nix's node
# BUNDLED_NODE=$(find .venv/lib/python3.10/site-packages/nodejs_wheel/bin -name "node" -type f 2>/dev/null || true)
# if [ -n "$BUNDLED_NODE" ]; then
# rm -f "$BUNDLED_NODE"
# ln -s "$(which node)" "$BUNDLED_NODE"
# echo "Replaced bundled node with Nix node: $(which node)"
# fi
# - name: Run Basedpyright
# run: uv run basedpyright --level error
# - name: Cleanup nix environment
# if: always()
# run: bash .github/scripts/cleanup-nix-env.sh
typecheck:
runs-on: depot-ubuntu-latest
env:
UV_PYTHON_PREFERENCE: only-system
permissions:
contents: read
actions: write
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Install Nix
uses: cachix/install-nix-action@v31
with:
install_url: https://install.determinate.systems/nix
install_options: "--no-confirm"
extra_nix_config: |
extra-experimental-features = nix-command flakes
#nix_path: nixpkgs=channel:nixpkgs-unstable
- name: Setup Magic Nix Cache
uses: DeterminateSystems/magic-nix-cache-action@v8

- name: Run Typecheck in Nix Shell
run: |
nix develop . -c bash <<'EOF'
# These commands run inside the temporary Nix shell
echo "--- Pinning Python and syncing dependencies ---"
uv python pin 3.10
uv sync --all-extras --all-groups

echo "--- Running Basedpyright ---"
uv run basedpyright --level error
EOF


test:
needs: filter
Expand All @@ -109,7 +112,7 @@ jobs:
- name: Save original environment
run: bash .github/scripts/save-env.sh
- name: Install Nix
uses: cachix/install-nix-action@v27
uses: cachix/install-nix-action@v31
with:
nix_path: nixpkgs=channel:nixpkgs-unstable
- name: Setup Magic Nix Cache
Expand Down Expand Up @@ -180,7 +183,7 @@ jobs:
- name: Save original environment
run: bash .github/scripts/save-env.sh
- name: Install Nix
uses: cachix/install-nix-action@v27
uses: cachix/install-nix-action@v31
with:
nix_path: nixpkgs=channel:nixpkgs-unstable
- name: Setup Magic Nix Cache
Expand All @@ -199,6 +202,8 @@ jobs:
check:
if: always()
runs-on: depot-ubuntu-latest
permissions:
contents: read
needs:
- test
- lint
Expand Down
5 changes: 5 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,6 +270,11 @@ Features can override `load_input()` for custom join logic:

This is critical for migrations when upstream dependencies change.

#### Attaching Metadata to Features

Additional metadata (JSON) can be attached to features via the `metadata` parameter on `FeatureSpec`.
Usecases may be for data governance such as ownership, SLAs, PII flags, ... etc.

## Important Constraints

### Narwhals as the Public Interface
Expand Down
4 changes: 4 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,10 @@ When `Video` changes, Metaxy automatically identifies that `VoiceDetection` requ

Every feature definition produces a deterministic version hash computed from its dependencies, fields, and code versions. When you modify a feature—whether changing its dependencies, adding fields, or updating transformation logic, Metaxy detects the change and propagates it downstream. This is done on multiple levels: `Feature` (class) level, field (class attribute) level, and of course on row level: each _sample_ in the metadata store tracks the version of _each field_ and the overall (class-level) feature version.

### Code vs Feature Versions

`Feature.code_version` only looks at a feature's own fields. It hashes their keys and `code_version` values (in sorted order) and ignores the dependency graph entirely. Use it to answer _"did my feature's logic change?"_. In contrast, `Feature.feature_version()` includes both the local fields and every dependency, so it changes whenever parent features evolve. Checking both hashes lets you distinguish between local code updates and upstream changes.

This ensures that when feature definitions evolve, every feature that transitively depends on it can be systematically updated. Because Metaxy supports declaring dependencies on fields, it can identify when a feature _does not_ require recomputation, even if one of its parents has been changed (but only irrelevant fields did). This is a huge factor in improving efficiency and reducing unnecessary computations (and costs!).

Because Metaxy feature graphs are static, Metaxy can calculate data version changes ahead of the actual computation. This enables patterns such as **computation preview** and **computation cost prediction**.
Expand Down
64 changes: 47 additions & 17 deletions docs/learn/feature-definitions.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,35 @@
# Feature System

Metaxy has a declarative (defined statically at class level), expressive, flexible feature system. It has been inspired by Software-Defined Assets in [Dagster](https://dagster.io/).
Metaxy has a declarative (defined statically at class level), expressive, flexible feature system.
It has been inspired by Software-Defined Assets in [Dagster](https://dagster.io/).

Features represent tabular **metadata**, typically containing references to external multi-modal **data** such as files, images, or videos. But it can be just pure **metadata** as well.
Features represent tabular **metadata**, typically containing references to external multi-modal **data** such as files, images, or videos.
But it can be just pure **metadata** as well.

I will highlight **data** and **metadata** with bold so it really stands out.

Metaxy is responsible for providing correct **metadata** to users. During incremental processing, Metaxy will automatically resolve added, changed and deleted **metadata** rows and calculate the right [sample versions](data-versioning.md) for them. Metaxy does not interact with **data** directly, the user is responsible for writing it, typically using **metadata** to identify sample locations in storage (it's a good idea to inject the sample version into the data sample identifier). Metaxy is designed to be used with systems that do not overwrite existing **metadata** (Metaxy only appends **metadata**) and therefore **data** as well (while we cannot enforce that since the user is responsible for writing the data, it's easily achievable by **including the sample version into the data sample identifier**).
Metaxy is responsible for providing correct **metadata** to users.
During incremental processing, Metaxy will automatically resolve added, changed and deleted **metadata** rows and calculate the right [sample versions](data-versioning.md) for them.
Metaxy does not interact with **data** directly, the user is responsible for writing it, typically using **metadata** to identify sample locations in storage (it's a good idea to inject the sample version into the data sample identifier).
Metaxy is designed to be used with systems that do not overwrite existing **metadata** (Metaxy only appends **metadata**) and therefore **data** as well (while we cannot enforce that since the user is responsible for writing the data, it's easily achievable by **including the sample version into the data sample identifier**).

I hope we can stop using bold for **data** and **metadata** from now on, hopefully we've made our point.

> [!tip] Include Sample Version In Your Data Path
> Include the sample version in your data path to ensure strong consistency guarantees. I mean it. Really do it!
> [!tip] Include sample version in your data path
> Include the sample version in your data path to ensure strong consistency guarantees.
> I mean it.
> Really do it!

Features live on a global `FeatureGraph` object (typically users do not need to interact with it directly). Features are bound to a specific Metaxy project, but can be moved between projects over time. Features must have unique (across all projects) `FeatureKey` associated with them.
Features live on a global `FeatureGraph` object (typically users do not need to interact with it directly).
Features are bound to a specific Metaxy project, but can be moved between projects over time.
Features must have unique (across all projects) `FeatureKey` associated with them.

## Feature Specs

Before we can define a `Feature`, we must first create a `FeatureSpec` object. But before we get to an example, it's necessary to understand the concept of ID columns. Metaxy must know how to uniquely identify feature samples and join metadata tables, therefore, you need to attach one or more ID columns to your `FeatureSpec`. Very often these ID columns would stay the same across many feature specs, therefore it makes a lot of sense to define them on a shared base class.
Before we can define a `Feature`, we must first create a `FeatureSpec` object.
But before we get to an example, it's necessary to understand the concept of ID columns.
Metaxy must know how to uniquely identify feature samples and join metadata tables, therefore, you need to attach one or more ID columns to your `FeatureSpec`.
Very often these ID columns would stay the same across many feature specs, therefore it makes a lot of sense to define them on a shared base class.

Some boilerplate with typing is involved (this is typically a good thing):

Expand All @@ -36,11 +48,17 @@ class VideoFeatureSpec(BaseFeatureSpec[VideoIds]):

`BaseFeatureSpec` is a [Pydantic](https://docs.pydantic.dev/latest/) model, so all normal Pydantic features apply.

Feature specs now support an optional `metadata` dictionary for attaching ownership, documentation, or tooling context to a feature.
This metadata **never** influences graph topology or version hashes, must be JSON-serializable, and is immutable once the spec is created.
It is ideal for values such as owners, SLAs, runbooks, or tags that external systems may want to inspect.

With our `VideoFeatureSpec` in place, we can proceed to defining features that would be using it.

## Feature Definitions

Metaxy provides a `BaseFeature` class that can be extended to make user-defined features. It's a Pydantic model as well. User-defined `BaseFeature` classes must have fields matching ID columns of the `FeatureSpec` they are using.
Metaxy provides a `BaseFeature` class that can be extended to make user-defined features.
It's a Pydantic model as well.
User-defined `BaseFeature` classes must have fields matching ID columns of the `FeatureSpec` they are using.

With respect to the same DRY principle, we can define a shared base class for features that use the `VideoFeatureSpec`.

Expand All @@ -61,7 +79,9 @@ class VideoFeature(BaseVideoFeature, spec=VideoFeatureSpec(key="/raw/video")):
path: str
```

That's it! That's a roow feature, it doesn't have any dependencies. Easy.
That's it!
That's a raw single feature, it doesn't have any dependencies.
Easy.

You may now use `VideoFeature.spec()` class method to access the original feature spec: it's bound to the class.

Expand All @@ -81,17 +101,23 @@ Hurray! You get the idea.

## Field-Level Dependencies

A core (I'be straight: a killer) feature of Metaxy is the concept of **field-level dependencies**. These are used to define dependencies between logical fields of features.
A core (I'be straight: a killer) feature of Metaxy is the concept of **field-level dependencies**.
These are used to define dependencies between logical fields of features.

A **field** is not to be confused with metadata _column_ (Pydantic fields). Fields are completely independent from them.
A **field** is not to be confused with metadata _column_ (Pydantic fields).
Fields are completely independent from them.

Columns refer to _metadata_ and are stored in metadata stores (such as databases) supported by Metaxy.

Fields refer to _data_ and are logical -- users are free to define them as they see fit. Fields are supposed to represent parts of data that users care about. For example, a `Video` feature -- an `.mp4` file -- may have `frames` and `audio` fields.
Fields refer to _data_ and are logical -- users are free to define them as they see fit.
Fields are supposed to represent parts of data that users care about.
For example, a `Video` feature -- an `.mp4` file -- may have `frames` and `audio` fields.

Downstream features can depend on specific fields of upstream features. This enables fine-grained control over data versioning, avoiding unnecessary reprocessing.
Downstream features can depend on specific fields of upstream features.
This enables fine-grained control over data versioning, avoiding unnecessary reprocessing.

At this point, careful readers have probably noticed that the `Transcript` feature from the [example](#feature-specs) above should not depend on the full video: it only needs the audio track in order to generate the transcript. Let's express that with Metaxy:
At this point, careful readers have probably noticed that the `Transcript` feature from the [example](#feature-specs) above should not depend on the full video: it only needs the audio track in order to generate the transcript.
Let's express that with Metaxy:

```py
from metaxy import FieldDep, FieldSpec
Expand All @@ -114,13 +140,16 @@ The [Data Versioning](data-versioning.md) docs explain more about this system.

### Fully Qualified Field Key

A **fully qualified field key (FQFK)** is an identifier that uniquely identifies a field within the whole feature graph. It consists of the **feature key** and the **field key**, separated by a colon, for example: `/raw/video:frames`, `/raw/video:audio/english`.
A **fully qualified field key (FQFK)** is an identifier that uniquely identifies a field within the whole feature graph.
It consists of the **feature key** and the **field key**, separated by a colon, for example: `/raw/video:frames`, `/raw/video:audio/english`.

## A Note on Type Coercion for Metaxy types

Internally, Metaxy uses strongly typed Pydantic models to represent feature keys, their fields, and the dependencies between them.

To avoid boilerplate, Metaxy also has syntactic sugar for construction of these classes. Different ways to provide them are automatically coerced into canonical internal models. This is fully typed and only affects **constructor arguments**, so accessing **attributes** on Metaxy models will always return only the canonical types.
To avoid boilerplate, Metaxy also has syntactic sugar for construction of these classes.
Different ways to provide them are automatically coerced into canonical internal models.
This is fully typed and only affects **constructor arguments**, so accessing **attributes** on Metaxy models will always return only the canonical types.

Some examples:

Expand All @@ -133,7 +162,8 @@ key = FeatureKey("prefix", "feature")
same_key = FeatureKey(key)
```

Metaxy really loves you, the user! See [syntactic sugar](#syntactic-sugar) for more details.
Metaxy really loves you, the user!
See [syntactic sugar](#syntactic-sugar) for more details.

## Syntactic Sugar

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ authors = [
requires-python = ">=3.10"
dependencies = [
"cyclopts==4.0.0b1",
"frozendict>=2.4.4",
"narwhals>=2.9.0",
"polars>=1.33.1",
"polars-hash>=0.5.1",
Expand Down
Loading
Loading