feat(aws_s3 sink): Add Apache Parquet encoder support #24372

rorylshanks · 2025-12-12T13:07:06Z

Summary

This PR adds Apache Parquet encoding support to the AWS S3 sink, enabling Vector to write columnar Parquet files optimized for analytics workloads.

Parquet is a columnar storage format that provides efficient compression and encoding, making it ideal for long-term storage and query performance with tools like AWS Athena, Apache Spark, and Presto. This implementation allows users to write properly formatted Parquet files with configurable schemas, compression, and row group sizing.

Key features:

Complete Parquet encoder implementation with Apache Arrow integration
YAML schema configuration support (field names → data types)
Support for all common data types (strings, integers, floats, timestamps, booleans, etc.)
Configurable compression algorithms (snappy, gzip, zstd, lz4, brotli)
Row group size control for query parallelization
Nullable field support
Comprehensive test suite (9 unit tests)
Full documentation for schema configuration and Parquet options

Vector configuration

sources:
  events:
    type: kafka
    bootstrap_servers: "kafka:9092"
    topics:
      - events

transforms:
  prepare:
    inputs:
      - events
    type: remap
    source: |
      parsed = parse_json(.message) ?? {}
      .uuid = parsed.uuid
      .properties = parsed.properties
      

sinks:
  s3_events:
    type: aws_s3
    inputs:
      - prepare
    bucket: my-bucket
    region: us-east-1
    compression: none  # Parquet handles compression internally

    batch:
      max_events: 50000
      timeout_secs: 60

    encoding:
      codec: parquet
      parquet:
        compression: zstd
        allow_nullable_fields: true
        schema:
          timestamp: timestamp_microsecond
          uuid: utf8
          properties: utf8

How did you test this PR?

I tested it against production Kafka data, and it produced correctly formatted Parquet files in S3.

Change Type

Bug fix
New feature (Parquet encoder for AWS S3 sink)
Non-functional (chore, refactoring, docs)
Performance

Is this a breaking change?

Yes
No

Does this PR include user facing changes?

Yes. Please add a changelog fragment based on our guidelines.
No. A maintainer will apply the no-changelog label to this PR.

References

github-actions · 2025-12-12T13:07:31Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

rorylshanks · 2025-12-12T13:09:52Z

I have read the CLA Document and I hereby sign the CLA

…t - can be tuned in config

…ut defaulted to off

…ema inference

thomasqueirozb

Hey @rorylshanks, thanks for your contribution! It looks like there are failing checks (run make check-clippy for example). This is also failing to compile after merging master because you removed BatchSerializerConfig::build which is used by the clickhouse sink. I'll circle back to this PR and give it a review once I see commits pushed to this branch

Cargo.toml

Co-authored-by: Thomas <[email protected]>

rorylshanks · 2025-12-27T16:54:35Z

Hey Thomas! Thanks for the review so far. I have made some changes to fix the check issues, and also to change how the docs are templated, as I think this codec only makes sense for AWS S3 and other S3 APIs, and I don't really have time to test that the parquet implementation works for every sink. So I limited the docs to only include the parquet format for S3 et al, and not for others. But please correct me if this is the wrong approach!
I am currently having issues with the cue docs checker, as I have changed how the templating works for the docs, the check is not happy, and I can't figure out how to make it happy. Do you have any ideas here?

abhishek-dream11 · 2026-01-07T13:03:35Z

Hi @rorylshanks , is buffer supported added for back pressure handling ? nice for taking this up.

thomasqueirozb · 2026-01-08T15:24:16Z

scripts/generate-component-docs.rb

  return sort_hash_nested(unwrapped_resolved_schema)
 end

+PARQUET_ALLOWED_SINKS = %w[aws_s3 gcp_cloud_storage azure_blob].freeze


Is there any particular reason for this? I assume it'd be possible to use the parquet codec with the file sink also no? In general we don't "restrict" codecs like this, sometimes they are incredibly useful for debugging. (One recent example that comes to mind is the otlp codec which is not useful at all outside of some select sinks, notably the opentelemetry sink, but we still support it for all sinks)

thomasqueirozb · 2026-01-09T21:08:28Z