Skip to content

Conversation

eferm
Copy link

@eferm eferm commented Aug 21, 2025

When parsing CSV files containing whitespace around the delimiter character [1] there's an issue resulting in null rows when passing discovered_schema.

Diagnosis:

  • The schema returned by CsvParser.infer_schema strips whitespace around header names
  • When reading CSV data with parse_records, rows are indexed by the un-stripped header names, so when validating the rows with the inferred schema passed via the discovered_schema param, these validation lookups fail and return null values

Effectively this means you can't combine infer_schema with parse_records when column names contain whitespace.

[1] For example files like this:

header1 ,\theader2
value1,value2

Summary by CodeRabbit

  • Bug Fixes

    • CSV parsing now trims leading/trailing whitespace from header names, ensuring consistent field names in parsed records.
    • Schema inference aligns with the parsed headers to prevent mismatches between inferred schema keys and actual data field names.
  • Tests

    • Added a unit test to verify that headers with surrounding whitespace are correctly normalized and data maps to the expected field names.

Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

📝 Walkthrough

Walkthrough

CSV header handling adjusted to strip whitespace during read and to keep original (post-strip) names during schema inference. A new unit test verifies that headers with surrounding whitespace are normalized and that parsed records use stripped header names.

Changes

Cohort / File(s) Summary of Changes
CSV parser header normalization
airbyte_cdk/.../file_types/csv_parser.py
Strip leading/trailing whitespace from header values when reading; align schema inference to use the normalized headers without additional stripping.
Unit test for header stripping
unit_tests/.../file_types/test_csv_parser.py
Add test ensuring headers like "header1 ,\theader2" are normalized to "header1","header2" and data maps accordingly.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Client
    participant CSVParser
    participant CSVReader as CSV Reader
    participant DictReader as DictReader
    participant Schema as Schema Inference

    Client->>CSVParser: parse(stream)
    CSVParser->>CSVReader: read first row (headers)
    CSVReader-->>CSVParser: ["header1 ", "\theader2"]
    Note right of CSVParser: Normalize headers by stripping whitespace
    CSVParser->>DictReader: init with ["header1","header2"]
    CSVParser->>Schema: infer using normalized headers
    DictReader-->>CSVParser: {"header1":"1","header2":"2"}
    CSVParser-->>Client: records
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)

131-134: Normalize header names: solid fix; consider BOM stripping and duplicate-header collisions, wdyt?

Trimming at read-time is the right place to fix the mismatch between infer_schema and parse_records. Two small robustness tweaks you might consider:

  • UTF‑8 BOM on first header cell (common in CSVs exported from Excel) can leak into the first field name; we can strip it.
  • After normalization, distinct raw headers can collapse into the same key (e.g., "id " and "id"). Today, DictReader will silently overwrite earlier columns. Should we dedupe with a suffix or at least detect and log this?

Proposed localized change (keeps behavior but adds BOM strip and optional de-duplication that preserves order):

-            headers = [header.strip() for header in next(reader)]
+            raw_headers = next(reader)
+            # Normalize: strip BOM from first cell and trim whitespace on all headers.
+            normalized = [
+                (h.lstrip("\ufeff") if i == 0 else h).strip()
+                for i, h in enumerate(raw_headers)
+            ]
+            # Optional: ensure uniqueness to avoid silent overwrites in DictReader.
+            # We add a numeric suffix when collisions occur: "id", "id__2", "id__3", ...
+            seen = {}
+            headers: List[str] = []
+            for h in normalized:
+                if h in seen:
+                    seen[h] += 1
+                    headers.append(f"{h}__{seen[h]}")
+                else:
+                    seen[h] = 1
+                    headers.append(h)

If you’d rather just detect and warn (without changing keys), we can drop the suffixing loop and emit a warning via the provided logger in the calling context, wdyt?

I can wire a lightweight duplicate detection that logs a single warning with the original vs. normalized header list and add unit tests for both BOM stripping and duplicate handling—should I push that?

unit_tests/sources/file_based/file_types/test_csv_parser.py (1)

661-674: Add an end-to-end test that chains infer_schema → parse_records to prevent regressions, wdyt?

To lock in the fix, could we add a test that:

  • runs infer_schema on a CSV with header whitespace,
  • passes the resulting schema into parse_records,
  • and asserts correctly typed, non-empty records come out?

Example (can be dropped in this test module):

def test_infer_schema_then_parse_records_with_whitespace_headers():
    parser = CsvParser()
    stream_reader = Mock()
    mock_obj = stream_reader.open_file.return_value
    # Note: intentional whitespace in headers
    csv_text = "c1 ,\tc2\n1,2\n"
    mock_obj.__enter__ = Mock(return_value=io.StringIO(csv_text))
    mock_obj.__exit__ = Mock(return_value=None)
    file = RemoteFile(uri="mem://whitespace.csv", last_modified=datetime.now())
    config = FileBasedStreamConfig(name="t", validation_policy="Emit Record", file_type="csv", format=CsvFormat())

    # infer
    loop = asyncio.get_event_loop()
    inferred = loop.run_until_complete(parser.infer_schema(config, file, stream_reader, logger))
    assert inferred == {"c1": {"type": "integer"}, "c2": {"type": "integer"}}

    # parse with discovered schema
    mock_obj.__enter__ = Mock(return_value=io.StringIO(csv_text))
    out = list(parser.parse_records(config, file, stream_reader, logger, {"properties": inferred}))
    assert out == [{"c1": 1, "c2": 2}]

We could also add small tests for BOM removal and duplicate-header collision behavior if we decide to implement those, wdyt?

Happy to add the E2E test (and BOM/duplicate variants) in this PR if that helps.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between cd48741 and 5f6b9a8.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/file_based/file_types/csv_parser.py (2 hunks)
  • unit_tests/sources/file_based/file_types/test_csv_parser.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
unit_tests/sources/file_based/file_types/test_csv_parser.py (2)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
  • open_file (60-72)
unit_tests/sources/file_based/in_memory_files_source.py (4)
  • open_file (162-172)
  • open_file (226-229)
  • open_file (249-252)
  • open_file (275-278)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: SDM Docker Image Build
🔇 Additional comments (2)
airbyte_cdk/sources/file_based/file_types/csv_parser.py (1)

212-214: Schema keys sourced directly from parsed rows keeps inference and parsing aligned — LGTM.

Using header as produced by read_data (which now normalizes headers) ensures infer_schema and parse_records use identical keys. This resolves the schema/row mismatch without further special-casing.

unit_tests/sources/file_based/file_types/test_csv_parser.py (1)

661-674: Great regression test covering whitespace around delimiters in headers.

This exercises the real-world case that previously broke the infer_schema + parse_records combo. Nice and concise.

@eferm eferm changed the title Fix bug parsing CSVs combining infer_schema with parse_records when column names contain leading/trailing whitespace fix(csv-parser): Parse CSVs with inferred schema when column names contain leading/trailing whitespace Aug 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant