Skip to content

Conversation

@rorylshanks
Copy link

@rorylshanks rorylshanks commented Dec 12, 2025

Summary

This PR adds Apache Parquet encoding support to the AWS S3 sink, enabling Vector to write columnar Parquet files optimized for analytics workloads.

Parquet is a columnar storage format that provides efficient compression and encoding, making it ideal for long-term storage and query performance with tools like AWS Athena, Apache Spark, and Presto. This implementation allows users to write properly formatted Parquet files with configurable schemas, compression, and row group sizing.

Key features:

  • Complete Parquet encoder implementation with Apache Arrow integration
  • YAML schema configuration support (field names → data types)
  • Support for all common data types (strings, integers, floats, timestamps, booleans, etc.)
  • Configurable compression algorithms (snappy, gzip, zstd, lz4, brotli)
  • Row group size control for query parallelization
  • Nullable field support
  • Comprehensive test suite (9 unit tests)
  • Full documentation for schema configuration and Parquet options

Vector configuration

sources:
  events:
    type: kafka
    bootstrap_servers: "kafka:9092"
    topics:
      - events

transforms:
  prepare:
    inputs:
      - events
    type: remap
    source: |
      parsed = parse_json(.message) ?? {}
      .uuid = parsed.uuid
      .properties = parsed.properties
      

sinks:
  s3_events:
    type: aws_s3
    inputs:
      - prepare
    bucket: my-bucket
    region: us-east-1
    compression: none  # Parquet handles compression internally

    batch:
      max_events: 50000
      timeout_secs: 60

    encoding:
      codec: parquet
      parquet:
        compression: zstd
        allow_nullable_fields: true
        schema:
          timestamp: timestamp_microsecond
          uuid: utf8
          properties: utf8

How did you test this PR?

I tested it against production Kafka data, and it produced correctly formatted Parquet files in S3.

Change Type

  • Bug fix
  • New feature (Parquet encoder for AWS S3 sink)
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the no-changelog label to this PR.

References

@rorylshanks rorylshanks requested review from a team as code owners December 12, 2025 13:07
@github-actions github-actions bot added domain: sinks Anything related to the Vector's sinks domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: external docs Anything related to Vector's external, public documentation labels Dec 12, 2025
@rorylshanks rorylshanks changed the title Added parquet encoding to Vector AWS S3 Output feat(aws_s3 sink): Add Apache Parquet encoder support Dec 12, 2025
@github-actions
Copy link

github-actions bot commented Dec 12, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@rorylshanks
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@drichards-87 drichards-87 self-assigned this Dec 12, 2025
@drichards-87 drichards-87 removed their assignment Dec 12, 2025
@rorylshanks rorylshanks marked this pull request as draft December 14, 2025 15:34
@rorylshanks rorylshanks marked this pull request as ready for review December 16, 2025 06:34
@github-actions github-actions bot added the domain: ci Anything related to Vector's CI environment label Dec 22, 2025
Copy link
Contributor

@thomasqueirozb thomasqueirozb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rorylshanks, thanks for your contribution! It looks like there are failing checks (run make check-clippy for example). This is also failing to compile after merging master because you removed BatchSerializerConfig::build which is used by the clickhouse sink. I'll circle back to this PR and give it a review once I see commits pushed to this branch

@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Dec 22, 2025
Co-authored-by: Thomas <[email protected]>
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Dec 23, 2025
@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Dec 23, 2025
@rorylshanks rorylshanks requested a review from a team as a code owner December 24, 2025 12:02
@github-actions github-actions bot removed the meta: awaiting author Pull requests that are awaiting their author. label Dec 24, 2025
@rorylshanks
Copy link
Author

rorylshanks commented Dec 27, 2025

Hey Thomas! Thanks for the review so far. I have made some changes to fix the check issues, and also to change how the docs are templated, as I think this codec only makes sense for AWS S3 and other S3 APIs, and I don't really have time to test that the parquet implementation works for every sink. So I limited the docs to only include the parquet format for S3 et al, and not for others. But please correct me if this is the wrong approach!
I am currently having issues with the cue docs checker, as I have changed how the templating works for the docs, the check is not happy, and I can't figure out how to make it happy. Do you have any ideas here?

@abhishek-dream11
Copy link

abhishek-dream11 commented Jan 7, 2026

Hi @rorylshanks , is buffer supported added for back pressure handling ? nice for taking this up.

return sort_hash_nested(unwrapped_resolved_schema)
end

PARQUET_ALLOWED_SINKS = %w[aws_s3 gcp_cloud_storage azure_blob].freeze
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason for this? I assume it'd be possible to use the parquet codec with the file sink also no? In general we don't "restrict" codecs like this, sometimes they are incredibly useful for debugging. (One recent example that comes to mind is the otlp codec which is not useful at all outside of some select sinks, notably the opentelemetry sink, but we still support it for all sinks)

Comment on lines +118 to +119
1. **Explicit Schema**: Define the exact structure and data types for your Parquet files
2. **Automatic Schema Inference**: Let Vector automatically infer the schema from your event data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Explicit Schema**: Define the exact structure and data types for your Parquet files
2. **Automatic Schema Inference**: Let Vector automatically infer the schema from your event data
1. **Explicit Schema**: Define the exact structure and data types for your Parquet files.
2. **Automatic Schema Inference**: Let Vector automatically infer the schema from your event data.

Comment on lines +446 to +450
### allow_nullable_fields
When enabled, missing or incompatible values will be encoded as NULL even for fields that
would normally be non-nullable. This is useful when working with downstream systems that
can handle NULL values through defaults or computed columns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### allow_nullable_fields
When enabled, missing or incompatible values will be encoded as NULL even for fields that
would normally be non-nullable. This is useful when working with downstream systems that
can handle NULL values through defaults or computed columns.

This can be removed as we already have the automatically generated doc for this option

Comment on lines +496 to +513
**Per-column Bloom filter settings:**
- **bloom_filter**: Enable Bloom filter for this column (default: `false`)
- **bloom_filter_num_distinct_values**: Expected number of distinct values for this column's Bloom filter
- Low cardinality (countries, states): `1,000` - `100,000`
- Medium cardinality (cities, products): `100,000` - `1,000,000`
- High cardinality (user IDs, UUIDs): `10,000,000+`
- If not specified, defaults to `1,000,000`
- Automatically capped to the `row_group_size` value
- **bloom_filter_false_positive_pct**: False positive probability for this column's Bloom filter
- `0.05` (5%): Good balance for general use
- `0.01` (1%): Better for high-selectivity queries where precision matters
- `0.10` (10%): Smaller filters when storage is a concern
- If not specified, defaults to `0.05`
A false positive means the Bloom filter indicates a value *might* be in a row group when it
actually isn't, requiring the engine to read and filter that row group. Lower FPP means fewer
unnecessary reads but larger Bloom filters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This info should be present in the automatically generated documentation for these fields instead

encoding:
codec: parquet
parquet:
schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a .parquet format (or something similar) so that the user can point to a file instead of defining this inside the configuration yml? In these scenarios we usually opt for a config file instead. See parse_proto and validate_json_schema.

Comment on lines +359 to +375
#### max_columns
Maximum number of columns to encode when using automatic schema inference. Additional
columns beyond this limit will be silently dropped. Columns are selected in the order
they appear in the first event.
This protects against accidentally creating Parquet files with too many columns, which
can cause performance issues in query engines.
**Only applies when `infer_schema` is enabled**. Ignored when using explicit schema.
**Default**: `1000`
**Recommended values:**
- Standard use cases: `1000` (default)
- Wide tables: `500` - `1000`
- Performance-critical: `100` - `500`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we should avoid including parameter descriptions and examples in this file. These should live in source code instead so that the documentation is easier to maintain and avoid drift

.to_compression(config.compression_level)
.map_err(vector_common::Error::from)?;

tracing::debug!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tracing::debug!(
debug!(

Nit: we usually avoid using tracing:: and opt to import this at the start of the file instead.

Comment on lines +496 to +497
let fpp = bloom_config.fpp.unwrap_or(0.05); // Default 5% false positive rate
let mut ndv = bloom_config.ndv.unwrap_or(1_000_000); // Default 1M distinct values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a static constant instead. Magic values tend to fall out of date with docs

Comment on lines +269 to +278
/// **Example:**
/// ```yaml
/// sorting_columns:
/// - column: timestamp
/// descending: true
/// - column: user_id
/// descending: false
/// ```
///
/// If not specified, rows are written in the order they appear in the batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that using #[configurable(metadata(docs::examples = "foo bar"))] works here. I'm not sure how this would render with multi line examples but if it the rendering doesn't work properly I'll make sure to fix it in a separate PR

Comment on lines +106 to +140
/// Mutually exclusive with `infer_schema`. Must specify either `schema` or `infer_schema: true`.
///
/// Supported types: utf8, int8, int16, int32, int64, uint8, uint16, uint32, uint64,
/// float32, float64, boolean, binary, timestamp_second, timestamp_millisecond,
/// timestamp_microsecond, timestamp_nanosecond, date32, date64, and more.
#[serde(default)]
#[configurable(metadata(docs::examples = "schema_example()"))]
pub schema: Option<SchemaDefinition>,

/// Automatically infer schema from event data
///
/// When enabled, the schema is inferred from each batch of events independently.
/// The schema is determined by examining the types of values in the events.
///
/// **Type mapping:**
/// - String values → `utf8`
/// - Integer values → `int64`
/// - Float values → `float64`
/// - Boolean values → `boolean`
/// - Timestamp values → `timestamp_microsecond`
/// - Arrays/Objects → `utf8` (serialized as JSON)
///
/// **Type conflicts:** If a field has different types across events in the same batch,
/// it will be encoded as `utf8` (string) and all values will be converted to strings.
///
/// **Important:** Schema consistency across batches is the operator's responsibility.
/// Use VRL transforms to ensure consistent types if needed. Each batch may produce
/// a different schema if event structure varies.
///
/// **Bloom filters:** Not supported with inferred schemas. Use explicit schema for Bloom filters.
///
/// Mutually exclusive with `schema`. Must specify either `schema` or `infer_schema: true`.
#[serde(default)]
#[configurable(metadata(docs::examples = true))]
pub infer_schema: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be expressed with an enum which will simplify a lot of the checks. It will also render docs in such a way that will make it clear when certain options are available with a schema and those available without.

Something like

enum Schema {
  Inferred { infer_schema: bool, exclude_columns: Option<Vec<String>> },
  Schema { schema: SchemaDefinition },
}


/// Configuration for Parquet serialization
#[configurable_component]
pub struct ParquetSerializerConfig {
  #[serde(flatten)]
  schema: Schema

This is very rough and removing all doc comments but should work. I'm sure you can find examples of how to do this properly in the Vector code base

@thomasqueirozb thomasqueirozb added the meta: awaiting author Pull requests that are awaiting their author. label Jan 9, 2026
@thomasqueirozb
Copy link
Contributor

I have made some changes to fix the check issues, and also to change how the docs are templated, as I think this codec only makes sense for AWS S3 and other S3 APIs, and I don't really have time to test that the parquet implementation works for every sink.

I wouldn't expect anyone to test every sink! I think we can do just a quick sanity check and write to a file using the file sink or event a socket using the socket sink and verify that it is working correctly. We can then add a note to the codec docs specifying which sinks it should be used with. As I mentioned before, we usually show all codecs, even if it is very unusual for anyone to use some of them.

I am currently having issues with the cue docs checker, as I have changed how the templating works for the docs, the check is not happy, and I can't figure out how to make it happy. Do you have any ideas here?

don't worry about this for now. Worst case scenario I can fix this myself before merging :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

domain: ci Anything related to Vector's CI environment domain: codecs Anything related to Vector's codecs (encoding/decoding) domain: external docs Anything related to Vector's external, public documentation domain: sinks Anything related to the Vector's sinks meta: awaiting author Pull requests that are awaiting their author.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support parquet columnar format in the aws_s3 sink

4 participants