Skip to content

[FEATURE] Apache Arrow File Ingestion Support #235

Description

@KaifAhmad1

Problem Statement

Apache Arrow is a columnar in-memory format for high-performance data processing, but Semantica's file ingestion doesn't have dedicated Arrow file parsing support. Adding Arrow support would enable ingestion from high-performance data files without requiring credentials.

Why This Is Necessary for Semantica: Arrow is designed for zero-copy reads and high-performance data processing. Supporting Arrow ingestion enables efficient processing of columnar data.

Current Status: Arrow file parsing not implemented. Contributions are welcome!

Features

Arrow File Reading: Read Arrow files, extract data efficiently, zero-copy reads

Schema Extraction: Extract Arrow schema, column types, metadata

Batch Processing: Process Arrow batches, handle streaming Arrow files

Memory Efficiency: Leverage Arrow's zero-copy capabilities, efficient memory usage

Metadata Extraction: Extract file metadata, batch information, schema details

Files

Enhance semantica/ingest/file_ingestor.py or create semantica/ingest/arrow_ingestor.py:

  • ArrowIngestor - Arrow file ingestion class
  • Integration with existing file ingestion

Getting Started

Current State: Arrow file parsing not implemented. New feature opportunity!

Reference Patterns: semantica/ingest/file_ingestor.py for file patterns

Libraries: pyarrow for reading Arrow files

Testing: No credentials required - use local Arrow files for testing!

Metadata

Metadata

Labels

No fields configured for Feature.

Projects

Status
In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions