CLAUDE.md

Guidance for Claude Code working in this repository.

What it is

Java 25 native implementation of the Vortex columnar file format. Uses FFM (MemorySegment/Arena) — never JNI or sun.misc.Unsafe.

Module structure

core    — DType, PType, VortexException, VortexFormat + generated fbs/proto
          encoding: EncodingId, TimeUnit, PTypeIO   extension: ExtensionId
reader  — VortexReader, VortexHttpReader, VortexHandle, ReadRegistry, ExtensionDecoder,
          Chunk, ArrayStats, ScanOptions, RowFilter; file internals (Footer, Layout, Trailer,
          PostscriptParser, …)
          reader.array  — Array + all subtypes (decode outputs)
          reader.decode — EncodingDecoder, DecodeContext, ArrayNode + *EncodingDecoder impls
          reader.extension — Date/Time/Timestamp/Uuid ExtensionDecoder
writer  — VortexWriter, WriteRegistry, WriteOptions, ExtensionEncoder
          writer.encode — EncodingEncoder, EncodeContext, NullableData + *EncodingEncoder impls,
          extension encoders

Dependency rule: writer → core, reader → core. Writer never depends on reader. Array and subtypes are decode outputs — they live in reader.array, not core.

Branching

Trunk-based. PRs fine but always squash or rebase — no merge commits. Keep commits small, main always green.

Commands

Never mvn install / ./mvnw install. Normal builds need no external tools; generated fbs/proto sources are committed under core/src/main/java.

./mvnw verify                              # build all
./mvnw verify -DskipTests                  # build, no tests
./mvnw test                                # unit only (excludes *IntegrationTest)
./mvnw test -pl reader                     # one module
./mvnw test -pl reader -Dtest=MyTest       # one class
./mvnw test -pl reader -Dtest=MyTest#m     # one method
./mvnw verify -pl integration -am          # integration (failsafe, NOT surefire)
./mvnw verify -pl integration -am -Dit.test="RustWritesJavaReadsIntegrationTest#method"
./bench RustVsJavaReadBenchmark.javaReadVolume   # benchmark — always ClassName.methodName filter

Regenerate after editing .fbs/.proto:

brew install flatbuffers              # only for .fbs edits (any flatc version; guard auto-stripped)
./mvnw compile -pl proto-gen          # only on .proto edits
./mvnw generate-sources -pl core -P regenerate-sources   # then commit

flatc runs whenever the profile is active; if you only changed .proto, revert spurious fbs/ diffs: git checkout -- core/src/main/java/io/github/dfa1/vortex/fbs/. Proto-to-Java is in-process via proto-gen (no protoc/protobuf-java): one record per message with static decode(MemorySegment, long, long) + encode() operating directly on a segment.

Mutation testing

Opt-in PIT profile in core and reader (-P pitest), bound to the verify phase and scoped to the bounds/parse classes via <targetClasses> in each module POM. Used to harden the security-critical bounds guards (ADR 0003 Phase E).

./mvnw -pl reader -am -P pitest verify -DskipITs   # reader run (-am builds core; -DskipITs skips ITs)
./mvnw -pl core -P pitest verify                   # core run (IoBounds)

Report: <module>/target/pit-reports/index.html (+ mutations.xml for scripting). Widen a run by adding <param> entries under <targetClasses> in the module's pitest profile.

Do not invoke the goal directly (org.pitest:pitest-maven:mutationCoverage) — it resolves the latest plugin without the JUnit 5 engine and ignores the profile; always go through -P pitest.

Read survivors as a simplify-first signal, not only a test-gap signal: an equivalent mutant often marks a clause that can never change the outcome (dead code) — delete it rather than writing an unkillable test. Only add a test when the mutated bound is a genuine, independent edge.

Releasing

./mvnw --batch-mode release:clean release:prepare \
    -DreleaseVersion=<version> -DdevelopmentVersion=<next>-SNAPSHOT
git push && git push --tags          # GitHub Actions deploys the tag to Maven Central

File format

8-byte trailer at EOF: version(u16 LE) | postscriptLen(u16 LE) | magic(VTXF). The postscript (FlatBuffer, immediately before the trailer) points (offset+length) to the Footer (FlatBuffer), DType (Protobuf), and Layout (FlatBuffer) blobs elsewhere in the file.

Layout tree: Struct → Zoned(Stats) → Chunked → [Flat, Flat, ...]

Flat single encoded segment · Chunked sequence of Flats · Struct one child/column
Zoned (vortex.stats) wraps a child with per-chunk min/max for zone-map pruning

Encoding IDs are strings ("vortex.flat", "fastlanes.bitpacked"). ReadRegistry maps IDs → EncodingDecoder via ServiceLoader; immutable after construction — register custom decoders on the builder: ReadRegistry.builder().registerServiceLoaded().register(myDecoder).build().

Adding an encoding

Add EncodingId enum constant VORTEX_FOO("vortex.foo"), then per side:

Decode: FooEncodingDecoder implements EncodingDecoder in reader.decode + FQN in reader/.../META-INF/services/io.github.dfa1.vortex.reader.decode.EncodingDecoder
Encode: FooEncodingEncoder implements EncodingEncoder in writer.encode + FQN in writer/.../META-INF/services/io.github.dfa1.vortex.writer.encode.EncodingEncoder

Adding an extension type

Add ExtensionId constant, then per side:

Decode: FooExtensionDecoder implements ExtensionDecoder in reader.extension; register via ReadRegistry.builder().register(new FooExtensionDecoder()) — no service file (registerServiceLoaded() does not discover ExtensionDecoder).
Encode: FooExtensionEncoder implements ExtensionEncoder in writer + FQN in writer/.../META-INF/services/io.github.dfa1.vortex.writer.ExtensionEncoder

Memory model

VortexReader memory-maps the whole file into one confined-Arena MemorySegment. All Array buffers returned during scan are zero-copy slices of it — lifetime tied to the reader; close to release.

Allocation rule — never new byte[] + MemorySegment.ofArray() for decode output. Always ctx.arena().allocate(...) (off-heap, zero GC, scan-chunk lifetime). If a private helper lacks DecodeContext, pass an Arena arena param from the decode() call site.

// WRONG: heap alloc, GC pressure, extra copy
MemorySegment out = MemorySegment.ofArray(new byte[(int) (n * elemBytes)]);
// CORRECT
MemorySegment out = ctx.arena().allocate(n * elemBytes);

Hot-loop rule — no modulo/division/variable-target branch per element. A single i % cap per row blocks JIT auto-vectorization (C2 superword refuses Op_ModL/Op_DivL; no SIMD integer-divide opcode) — and loop-invariant cap doesn't help (strength-reduction needs a compile-time constant divisor). Scalar modulo is also 20–40 cycles vs ~1 for a load on Apple silicon. One modulo in a 1M-row body has caused 5–10× regressions here (ed658b7→051a794→442021f). Same for bounds/validity-bit checks and sign-extension switches — anything making the body non-uniform. For broadcast/clamp/mask, branch-split: hoist the check once, gate two specialized loop bodies.

long cap = SegmentBroadcast.capacity(src, 8);
if (cap == n) {                                  // fast path: zero modulos, vectorizes
    for (long i = 0; i < n; i++) { out.setAtIndex(LE_LONG, i, src.getAtIndex(LE_LONG, i)); }
} else {                                          // slow path: only ConstantEncoding broadcast
    for (long i = 0; i < n; i++) { out.setAtIndex(LE_LONG, i, src.getAtIndex(LE_LONG, i % cap)); }
}

Profile with JFR (-prof stack:lines=10); idiv/sdiv/arithmetic helpers as the hot frame is almost always this.

Reference implementation

When stuck on encode/decode behavior, read the Rust reference at https://github.com/spiraldb/vortex (via gh api repos/spiraldb/vortex/contents/<path>): encodings/fastlanes/src/{bitpacking,for}/, encodings/sparse/src/, encodings/alp/src/alp/, and https://github.com/spiraldb/fastlanes-rs (src/bitpacking.rs, src/macros.rs).

Never reverse-engineer wire formats by probing bytes. Read the vtable serialize/deserialize in the Rust source for the exact schema, then implement from spec.

Design decisions

DType is pluggable only via Extension. DType is a sealed interface; downstream code must not add variants. Use new DType.Extension("ip.address", new DType.Primitive(PType.I32, false), null, false) and register decoders/encoders on the registries (or ServiceLoader<ExtensionEncoder>). Mirrors Rust (vortex.date, vortex.uuid, …). No SPI for DType variants planned.
Layout is a fixed set, no SPI. ScanIterator.decodeLayout() dispatches the known IDs (flat/chunked/zoned/struct/dict) and throws otherwise. Keep the fixed set; revisit only for a concrete downstream case unaddressable by a different flat-segment encoding.
Small public APIs. Don't expose internals — when in doubt, leave it out or make it private.
POM deps grouped with comments:  then , each with project-internal (io.github.dfa1.vortex:*) deps first, then external. Omit empty sections.

Code style

4-space indent, zero SonarQube bugs/smells, no sun.misc.Unsafe or internal JDK APIs.
Prefer explicit over clever; fail fast on unhandled cases.
Idiomatic modern Java: reuse the JDK (override Iterator.forEachRemaining, don't invent forEachChunk; use Optional, records, sealed types, pattern switches, virtual threads, FFM). New APIs should feel like JDK APIs.
Always braces for if/else/for/while, even one-liners (if (c) { return a; }).
Time quantities use java.time.Duration, never long (no long timeoutMs/delayNanos). Exception: low-level JDK interop taking long ns (Thread.sleep, LockSupport.parkNanos, System.nanoTime math) — convert at the call site via duration.toNanos()/toMillis().

Javadoc (build-enforced: `failOnError` + `failOnWarnings`)

Every public method: main prose description, @param per parameter, @return (unless void). Every public record: @param per component on the class doc. @see-only counts as no description.
All /// Markdown — no HTML (checkstyle RegexpSingleline blocks <p>,<ul>,<li>, <strong>,<pre>,<table>, …). Use blank /// for paragraphs, - lists, ```java ```, **bold**. Cross-refs [ClassName#method(ParamType)] — verify the target exists (wrong refs are errors).
Check: ./mvnw javadoc:javadoc -pl core must produce zero output.

Encoding class structure

Encodings with non-trivial encode and decode separate them into private static inner classes Encoder and Decoder (shared low-level helpers live with their owner or a third inner class):

public final class FooEncoding implements Encoding {
    @Override public EncodeResult encode(DType dtype, Object data) { return Encoder.encode(dtype, data); }
    @Override public Array decode(DecodeContext ctx) { return Decoder.decode(ctx); }
    private static final class Encoder { static EncodeResult encode(DType dtype, Object data) { ... } }
    private static final class Decoder { static Array decode(DecodeContext ctx) { ... } }
}

Simple encodings (≤ ~80 lines, e.g. NullEncoding, BoolEncoding) are exempt.

Metadata-only encodings (all data in proto3 metadata, no buffers/children, e.g. SequenceEncoding): EncodeResult uses an EncodeNode with metadata set and empty bufferIndices; the decoder reads ctx.metadata() (not ctx.buffer(n)):

EncodeNode node = new EncodeNode(encodingId, ByteBuffer.wrap(meta.encode()), new EncodeNode[0], new int[]{});
// decode:
MemorySegment metaSeg = MemorySegment.ofBuffer(ctx.metadata().duplicate());
FooMetadata meta = FooMetadata.decode(metaSeg, 0, metaSeg.byteSize());

Generated proto records live in io.github.dfa1.vortex.proto; the runtime (ProtoReader, ProtoWriter) is package-private. For oneof messages (e.g. ScalarValue) prefer the static ofXxxValue(v) factory over the multi-arg constructor.

Testing

Cover happy path, negative cases (invalid input / errors), and corners (empty, zero, max, boundaries). Unit tests must be fast — no file I/O, network, or sleep; mock or use in-memory data.
Integration tests are ground truth (no formal spec): interop with the Rust reference. Write one for every encoding round-trip and file-format boundary.
JUnit 5 + Mockito (BDDMockito) + AssertJ. Class under test named sut. Every test has // Given / // When / // Then. BDDMockito only: given(mock.m()).willReturn(v) / then(...) (static-import only given/then, never willReturn/willThrow).
Prefer @ParameterizedTest over copy-paste (@ValueSource, else @ArgumentsSource/named cases). For large input spaces use seeded-random @MethodSource generators — they find corners examples miss. Put generators in RandomArrays (integration) or a similar util; keep counts low (10–30) when the test does file I/O or JNI.
@Nested groups related scenarios (@BeforeEach in a nested class applies only to it). Private helpers go after all @Test methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLAUDE.md

What it is

Module structure

Branching

Commands

Mutation testing

Releasing

File format

Adding an encoding

Adding an extension type

Memory model

Reference implementation

Design decisions

Code style

Javadoc (build-enforced: `failOnError` + `failOnWarnings`)

Encoding class structure

Testing

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

What it is

Module structure

Branching

Commands

Mutation testing

Releasing

File format

Adding an encoding

Adding an extension type

Memory model

Reference implementation

Design decisions

Code style

Javadoc (build-enforced: failOnError + failOnWarnings)

Encoding class structure

Testing

Javadoc (build-enforced: `failOnError` + `failOnWarnings`)