A database fuzzer (structured + black-box) for Apache Iceberg, and other file-format readers
FuzzBerg was built to secure the launch of Firebolt Core and READ_ICEBERG, and helped us overcome the challenges of fuzzing complex database interfaces, such as Table Valued Functions and COPY_FROM.
It quickly proved its worth by discovering 5 critical bugs across all our TVF formats- including READ_ICEBERG.
As a structured fuzzer, it relies on valid seed corpus. If a crash is detected in target, it writes the crash output to a user provided path and exits.
For more details about the internals, you can read the official blog.
- Fuzz data ingestion interfaces (e.g.,
COPY FROM, TVFs:read_iceberg(),read_csv(),read_parquet()) - No need to write/maintain unit-level harnesses
- Currently supported formats:
Iceberg,CSV,Parquet - Easily extensible for new targets and file-formats
Note: Iceberg fuzzing is currently supported for S3-based readers only. Use a compatible S3 interface such as Minio to fuzz on Linux platforms.
Mutations are both structure-aware and randomised with libRadamsa (no coverage guidance), seeded by a Mersenne Twister PRNG.
- Place target code under
src/Databases/<database>.{cpp,h} - Add
<database>.cpptoCMakeLists.txt - Implement a target DB class, and override the following base interfaces:
DatabaseHandler::ForkTarget(): to launch target as a child of the fuzzerDatabaseHandler::fuzz(): call the relevant file-format fuzzer
- Create a JSON file under
queries/<database>/*.jsonlisting relevant queries for your target.- Only add queries for file-formats currently supported by the fuzzer (
CSV,Parquet,Iceberg).
- Only add queries for file-formats currently supported by the fuzzer (
- Install
libcurl4-openssl-dev(Ubuntu/Debian). See details. - Build with CMake & Ninja:
mkdir build && cd build cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_C_COMPILER=clang-18 -DCMAKE_CXX_COMPILER=clang++-18 -G Ninja ../ ninja -j<N> fuzzberg
Note: For efficient fuzzing, compile your target with AddressSanitizer. Also, fuzzing a
Releasebuild is recommended (where invariants likeDCHECKis compiled out).
FuzzBerg is released under the Apache License 2.0. See the LICENSE file for details.
Usage: ./fuzzberg [OPTIONS]
Required:
-d, --database NAME Database name (e.g., duckdb, firebolt)
-f, --format FORMAT File format (csv, parquet, iceberg)
-u, --url URL Database server URL
-i, --input DIR Input corpus directory
-o, --output DIR Output (crash) directory
-b, --bin PATH Path to the target binary
-m, --mutate FILE Mutation payload file
-q, --queries FILE JSON file containing queries (see queries/<database>/*.json)
Optional:
-t, --auth TOKEN Authentication token (JWT)
-B, --bucket BUCKET_NAME S3 bucket name for Iceberg (required if --format=iceberg)./fuzzberg \
-i ./corpus_iceberg \
-o ./crash \
--database=firebolt \
--bucket iceberg-fuzzing \
--format=iceberg \
--url=http://localhost:3473 \
-m /data/minio/iceberg-fuzzing/metadata \
-q fb_core_iceberg.json \
-b ./firebolt-core[INFO] Loaded 1 queries from: ./fb_core_iceberg.json
Loading seed corpus from: ./corpus_iceberg/
[+] Loaded 7 metadata files and 8 manifest list files in the corpus.
Checking connection to server...
......
......
********* Starting generic metadata fuzzing *********
Query : SELECT * FROM READ_ICEBERG(url => 's3://iceberg-fuzzing/metadata/v3.metadata.json');
{
"errors": [
{
"description": "Exception: error: 1: unterminated string literal"
}
],
"query": {
"query_id": "0b6bddc7-881a-4531-9012-5d5e5dc2cb16",
"query_label": null,
"request_id": "f222ca06-a88d-4888-bfe7-86b2764a7828"
},
"statistics": {
"elapsed": 0.0
}
}
******** Starting structured metadata fuzzing *********
Key: "current-snapshot-id", Original Value: 4676137652994606811, Mutated Value: 170141183460469231731687303715884105727
Query : SELECT * FROM READ_ICEBERG(url => 's3://iceberg-fuzzing/metadata/v3.metadata.json');
Response: {
"errors": [
{
"description": "Exception: Value too large."
}
],
"query": {
"query_id": "c1c6a6c5-c612-438d-a574-ecc563303247",
"query_label": null,
"request_id": "54bd0463-c45b-448d-82ea-efd487c95e6e"
},
"statistics": {
"elapsed": 0.0
}
}
./fuzzberg \
-i ./corpus_parquet \
-o ./crash \
--database=firebolt \
--format=parquet \
--url=http://localhost:3473 \
-m /data/minio/black-box-fuzzer/ \
-q fb_core_parquet.json \
-b ./firebolt-coreQuery : SELECT * FROM READ_PARQUET(url => 's3://black-box-fuzzer/fuzz.parquet');
Response: {
"errors": [
{
"description": "Error reading column 'l_partkey' in row group 0 of 's3://black-box-fuzzer/fuzz.parquet': IOError: Corrupt snappy compressed data."
}
],
"query": {
"query_id": "63be960d-b218-41b6-afa1-dd5590d2d781",
"query_label": null,
"request_id": "b0404488-579f-41d3-b8cd-6e6f30fe2689"
},
"statistics": {
"elapsed": 0.016309347
}
}
DuckDB read_csv() (with HTTP Server Extension)
./fuzzberg \
-i ./corpus/csv \
-o ./crash \
--database=duckdb \
--format=csv \
--url=http://localhost:9999 \
-m /tmp \
-q duckdb_csv.json \
-b ./duckdb-extension-httpserver/build/release/duckdb \
-- \
--ascii \
--init /home/ubuntu/ddb/duckdb/init.sql \
--batch[INFO] Loaded 2 queries from: duckdb_csv.json
Loading seed corpus from: ./corpus/csv/
[+] Loaded 10 files in the corpus.
Checking connection to server...
┌──────────────────────────────────────┐
│ httpserve_start('0.0.0.0', 9999, '') │
│ varchar │
├──────────────────────────────────────┤
│ HTTP server started on 0.0.0.0:9999 │
└──────────────────────────────────────┘
Query : SELECT * FROM read_csv('/tmp/fuzz.csv',header = true,delim = '|',allow_quoted_nulls = false, ignore_errors=false);
Response: {"c9223372036854775809,c2,c3,c5,c5,c6,c7,c128,c9,c10,c11,c12,c13,c14,c15":"t,2�,,,,,,�,,,,,I ,,,c4294967296,c6,c212,c8,c9,c10,c�,c12,c13"}
{"c9223372036854775809,c2,c3,c5,c5,c6,c7,c128,c9,c10,c11,c12,c13,c14,c15":"e,QrUe,10,100,-32642,-263749625369741"}
Query : SELECT * FROM read_csv('/tmp/fuzz.csv');
Response: Invalid Input Error: CSV Error on Line: 1
Invalid unicode (byte sequence mismatch) detected. This file is not utf-8 encoded.
Possible Solution: Set the correct encoding, if available, to read this CSV File (e.g., encoding='UTF-16')
....
- Increase fuzzing speed: Achieve higher execs/sec
- Coverage-guided fuzzing: Since FuzzBerg controls targets via
fork/exec, targets can be instrumented to report code coverage to shared memory, allowing seed prioritization - Expanding support: New file formats, additional targets/protocols
If you discover a bug, please report it via GitHub Issues