-
Notifications
You must be signed in to change notification settings - Fork 30
fix(csv-parser): Parse CSVs with inferred schema when column names contain leading/trailing whitespace #718
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix(csv-parser): Parse CSVs with inferred schema when column names contain leading/trailing whitespace #718
Conversation
📝 WalkthroughWalkthroughCSV header handling adjusted to strip whitespace during read and to keep original (post-strip) names during schema inference. A new unit test verifies that headers with surrounding whitespace are normalized and that parsed records use stripped header names. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant CSVParser
participant CSVReader as CSV Reader
participant DictReader as DictReader
participant Schema as Schema Inference
Client->>CSVParser: parse(stream)
CSVParser->>CSVReader: read first row (headers)
CSVReader-->>CSVParser: ["header1 ", "\theader2"]
Note right of CSVParser: Normalize headers by stripping whitespace
CSVParser->>DictReader: init with ["header1","header2"]
CSVParser->>Schema: infer using normalized headers
DictReader-->>CSVParser: {"header1":"1","header2":"2"}
CSVParser-->>Client: records
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)
131-134
: Normalize header names: solid fix; consider BOM stripping and duplicate-header collisions, wdyt?Trimming at read-time is the right place to fix the mismatch between
infer_schema
andparse_records
. Two small robustness tweaks you might consider:
- UTF‑8 BOM on first header cell (common in CSVs exported from Excel) can leak into the first field name; we can strip it.
- After normalization, distinct raw headers can collapse into the same key (e.g.,
"id "
and"id"
). Today,DictReader
will silently overwrite earlier columns. Should we dedupe with a suffix or at least detect and log this?Proposed localized change (keeps behavior but adds BOM strip and optional de-duplication that preserves order):
- headers = [header.strip() for header in next(reader)] + raw_headers = next(reader) + # Normalize: strip BOM from first cell and trim whitespace on all headers. + normalized = [ + (h.lstrip("\ufeff") if i == 0 else h).strip() + for i, h in enumerate(raw_headers) + ] + # Optional: ensure uniqueness to avoid silent overwrites in DictReader. + # We add a numeric suffix when collisions occur: "id", "id__2", "id__3", ... + seen = {} + headers: List[str] = [] + for h in normalized: + if h in seen: + seen[h] += 1 + headers.append(f"{h}__{seen[h]}") + else: + seen[h] = 1 + headers.append(h)If you’d rather just detect and warn (without changing keys), we can drop the suffixing loop and emit a warning via the provided logger in the calling context, wdyt?
I can wire a lightweight duplicate detection that logs a single warning with the original vs. normalized header list and add unit tests for both BOM stripping and duplicate handling—should I push that?
unit_tests/sources/file_based/file_types/test_csv_parser.py (1)
661-674
: Add an end-to-end test that chains infer_schema → parse_records to prevent regressions, wdyt?To lock in the fix, could we add a test that:
- runs
infer_schema
on a CSV with header whitespace,- passes the resulting schema into
parse_records
,- and asserts correctly typed, non-empty records come out?
Example (can be dropped in this test module):
def test_infer_schema_then_parse_records_with_whitespace_headers(): parser = CsvParser() stream_reader = Mock() mock_obj = stream_reader.open_file.return_value # Note: intentional whitespace in headers csv_text = "c1 ,\tc2\n1,2\n" mock_obj.__enter__ = Mock(return_value=io.StringIO(csv_text)) mock_obj.__exit__ = Mock(return_value=None) file = RemoteFile(uri="mem://whitespace.csv", last_modified=datetime.now()) config = FileBasedStreamConfig(name="t", validation_policy="Emit Record", file_type="csv", format=CsvFormat()) # infer loop = asyncio.get_event_loop() inferred = loop.run_until_complete(parser.infer_schema(config, file, stream_reader, logger)) assert inferred == {"c1": {"type": "integer"}, "c2": {"type": "integer"}} # parse with discovered schema mock_obj.__enter__ = Mock(return_value=io.StringIO(csv_text)) out = list(parser.parse_records(config, file, stream_reader, logger, {"properties": inferred})) assert out == [{"c1": 1, "c2": 2}]We could also add small tests for BOM removal and duplicate-header collision behavior if we decide to implement those, wdyt?
Happy to add the E2E test (and BOM/duplicate variants) in this PR if that helps.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
airbyte_cdk/sources/file_based/file_types/csv_parser.py
(2 hunks)unit_tests/sources/file_based/file_types/test_csv_parser.py
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
unit_tests/sources/file_based/file_types/test_csv_parser.py (2)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
open_file
(60-72)unit_tests/sources/file_based/in_memory_files_source.py (4)
open_file
(162-172)open_file
(226-229)open_file
(249-252)open_file
(275-278)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Pytest (Fast)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: SDM Docker Image Build
🔇 Additional comments (2)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)
212-214
: Schema keys sourced directly from parsed rows keeps inference and parsing aligned — LGTM.Using
header
as produced byread_data
(which now normalizes headers) ensuresinfer_schema
andparse_records
use identical keys. This resolves the schema/row mismatch without further special-casing.unit_tests/sources/file_based/file_types/test_csv_parser.py (1)
661-674
: Great regression test covering whitespace around delimiters in headers.This exercises the real-world case that previously broke the
infer_schema
+parse_records
combo. Nice and concise.
infer_schema
with parse_records
when column names contain leading/trailing whitespace
When parsing CSV files containing whitespace around the delimiter character [1] there's an issue resulting in null rows when passing
discovered_schema
.Diagnosis:
CsvParser.infer_schema
strips whitespace around header namesparse_records
, rows are indexed by the un-stripped header names, so when validating the rows with the inferred schema passed via thediscovered_schema
param, these validation lookups fail and return null valuesEffectively this means you can't combine
infer_schema
withparse_records
when column names contain whitespace.[1] For example files like this:
Summary by CodeRabbit
Bug Fixes
Tests