Neythaleon is an efficient, dependency-light data ingestion and observability toolkit for marine biodiversity datasets. Originally built for OBIS datasets, it ingests .parquet files, performs a robust cleaning pipeline, streams the data to PostgreSQL, and tracks detailed system metrics.
"The Eye Below Logs Everything."
- Efficient Ingestion: Ingests large
.parquetfiles in memory-safe chunks using DuckDB. - Robust ETL: A multi-step cleaning and transformation pipeline handles data types, special characters, and complex geometry formats.
- High-Performance Loading: Uses PostgreSQL's native
COPYcommand for fast, bulk data insertion. - Comprehensive Observability: Logs detailed performance metrics, including CPU, RAM, throughput, and processing time per batch.
- Schema Safe: Automatically synchronizes the data's structure with the database table schema to prevent load failures.
- Automated Setup: Can automatically create the target database table if it doesn't exist.
- Extract: Read
.parquetfiles in memory-efficient chunks using DuckDB. - Transform: A pipeline of cleaning functions is applied to each chunk:
- Drop sparse columns and rows without valid coordinates.
- Enforce correct integer data types.
- Sanitize text fields by removing special characters.
- Convert binary geometry (WKB) to text (WKT).
- Align: The schema of the cleaned data is dynamically matched to the target database table's schema.
- Load: The prepared data is bulk-loaded into PostgreSQL via an in-memory TSV stream.
- Log:
- Time taken, CPU %, and RAM % are recorded for each chunk.
- Rows ingested and processing speed (rows/sec) are calculated.
- Metrics are exported to
metrics_log.csvand human-readable logs.
These dashboards are generated from the
metrics_log.csvfile usingplot.py, providing a clear view of the pipeline's performance across different datasets and schemas.
The baseline run on biodiversity data shows a highly stable throughput of over 11,000 rows/sec with very low CPU usage, indicating an I/O-bound and efficient process.
This test on the FHV dataset demonstrates the pipeline's consistency, achieving a stable throughput of ~9,000 rows/sec with remarkably uniform batch processing times.
This run on the larger NYC Yellow Taxi dataset shows the pipeline's ramp-up behavior, peaking at ~2,700 rows/sec and highlighting a positive correlation between CPU usage and throughput.
.
Neythaleon/ # repo root
ββ .env.examples # runtime config (DATABASE_URL, PARQUET_DIR, etc.)
ββ main.py # top-level runner (calls ingest CLI then plot CLI)
ββ requirements.txt # pinned/loose deps
ββ README.md # (optional) project overview & usage
β
ββ ingest/ # ingestion package
β ββ __init__.py # exports for ingest package
β ββ cli.py # CLI entrypoint for ingestion
β ββ config.py # env loading and typed config values
β ββ logging_config.py # centralized logging setup
β ββ ingest_runner.py # main orchestration loop (stream -> transform -> insert)
β ββ db_utils.py # get_table_columns / copy_insert / failed chunk save
β ββ parquet_utils.py # stream_parquet_chunks, get_full_schema_from_parquet
β ββ transform.py # clean_null_bytes, enforce_integer_types, convert_geometry_to_wkt, process_chunk
β ββ scheme_utils.py # coerce_df_to_schema (DB-type based coercion)
β ββ metrics.py # track_metrics + persist_metrics (CSV)
β
ββ plot/ # plotting package
β ββ __init__.py # exports for plot package
β ββ cli.py # CLI entrypoint for plotting (reads media/metrics CSV)
β ββ dashboard.py # create_single_png (layout + save)
β ββ metrics_loader.py # fuzzy CSV reading + numeric coercion
β ββ plotting_panels.py # plot_throughput, plot_cpu_vs_throughput, plot_memory, plot_batch_time
β
ββ parquet/ # (input) place your .parquet files here (PARQUET_DIR)
β ββ *.parquet
β
ββ media/ # output directory for metrics & saved PNGs (METRICS_FILE lives here)
β ββ ingestion_metrics.csv
β ββ metrics_dashboard.png # Multiple PNGs for different runs are presented i this markdown
β
ββ failed_chunks/ # saved CSVs when an insert/processing chunk fails
ββ failed_chunk_<id>_<ts>.csv-
Python 3.8+
-
DuckDB
-
Pandas
-
SQLAlchemy
-
Shapely
-
Psycopg2 (or other DB driver)
-
A PostgreSQL-compatible database
pip install -r requirements.txtThe pipeline is configured using a .env file. To get started, simply copy the provided template file to a new file named .env and then edit the values to match your dataset and database credentials.
You can do this in your terminal with the following command:
cp .env.example .envNow, open the newly created .env file and edit the variables.
-
DATABASE_URL: Your full connection string for your PostgreSQL database.
-
DB_TABLE: The name of the table where the data will be ingested. If the table doesn't exist, the script will create it based on the schema of the first data chunk.
Example:DB_TABLE="nyc_taxi_trips"
-
PARQUET_DIR: The path to the folder containing your
.parquetfiles.
Example:PARQUET_DIR="fhv_data"
-
LAT_COLUMN & LON_COLUMN: The exact column names for latitude and longitude in your dataset. These are only used if
VALIDATE_COORDSis enabled.
Example (for old taxi data):LAT_COLUMN="pickup_latitude"
-
VALIDATE_COORDS: Set to
trueto clean data based on coordinate columns. Set tofalseif your dataset does not have latitude/longitude columns (like the recent NYC TLC data).
Example:VALIDATE_COORDS="false"
This project was originally designed for records from OBIS - the Ocean Biodiversity Information System. It is built to handle the raw complexity of this data, performing the necessary preprocessing to make it database-ready.
The pipeline's flexibility has been validated against other public datasets, including:
-
NYC TLC Trip Records: Data for Yellow Taxis and For-Hire Vehicles (FHV) tests the script's ability to handle different schemas, particularly those without the coordinate columns used for geospatial validation.
-
GBIF & USGS: Other biodiversity and scientific datasets used to confirm the coordinate validation and data cleaning steps.
New York City TLC Data
New York City Taxi and Limousine Commission. (2025). Yellow Taxi Trip Records and For-Hire Vehicle Trip Records. Retrieved August 18, 2025, from nyc.gov
Ocean Biodiversity Information System Data
OBIS (2025). Global distribution records from the OBIS database. Ocean Biodiversity Information System. Intergovernmental Oceanographic Commission of UNESCO. Available at: OBIS.
Building observable and reproducible data pipelines is fundamental to reliable data science. This project serves as a real-world example of analyzing and optimizing a Python ETL script, demonstrating how to identify and resolve complex performance bottlenecks through iterative testing and analysis.
- Code: MIT
- Data: CC0 1.0
"The sea, once it casts its spell, holds one in its net of wonder forever." β Jacques Cousteau