Preserve Complex Data Types for to_csv #61157

Jaspvr · 2025-03-20T20:41:52Z

This Pull Request solves the issue outlined in:
#60895

Complex data types like numpy arrays can now be stored in csv format and later recovered.
A new parameter, preserve_complex, is introduces, and when it is set to true in your to_csv function call, the complex data types will be preserved and can be recovered from the csv.

The way this works is by serializing Numpy arrays into JSON format for preserve_complex=True. To get them from the csv, we can set the same parameter in read_csv, and the original Numpy array will be returned.

Please refer to tests in scripts/tests/test_csv.py to see how this is used.

Please refer to the original issue for more information on the problem definition.

snitish · 2025-03-23T05:42:45Z

@Jaspvr is there an existing issue that this PR addresses? If so, could you list it in the description? If not, please create an issue describing the bug or proposed enhancement so it can be reviewed by a team member.

Jaspvr · 2025-03-28T22:00:41Z

@Jaspvr is there an existing issue that this PR addresses? If so, could you list it in the description? If not, please create an issue describing the bug or proposed enhancement so it can be reviewed by a team member.

Hey, just updated the description. This is in relation to #60895

rhshadrach

Thanks for the PR! You may want to review our development documentation, namely this section:

https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#writing-tests

rhshadrach · 2025-03-29T20:02:16Z

pandas/core/generic.py


        {storage_options}

+        preserve_complex : bool, default False


As commented in the issue, you can use the dtype argument in read_csv to read complex values already. I'm negative on this approach.

CSV's fundamental design is meant for simple, flat data structures, and embedding complex objects within it starts to deviate from that principle. JSON arrays inside a CSV file may technically work, but it introduces challenges in readability, compatibility, and usability in spreadsheet software like Excel or Google Sheets.

With nearly 50 arguments, adding more requires careful consideration to avoid unnecessary complexity. Pandas is widely used in data processing, and every additional feature should be weighed against maintainability and ease of use.

Does it align with the expectations of CSV as a format that is universally readable? Is there significant demand for preserving complex structures within CSV, or are existing solutions (e.g., JSON, Parquet) sufficient?

It seems more reasonable to encourage users to store complex data types in formats like Parquet or JSON rather than forcing CSV to handle something it's not naturally designed for.

a-holm · 2025-03-31T07:12:53Z

This PR introduces a useful feature for preserving complex data types like NumPy arrays during CSV serialization/deserialization via the preserve_complex flag. The implementation integrates well with the existing I/O functions, and the new tests seem cover the basic functionality (I haven't run them).

Two minor points for consideration:

Serialization Logic (pandas/io/formats/csvs.py): The current logic checks only the first value in an object column to decide if it contains arrays/lists needing serialization. This might be fragile if the first row has NaN or a different type. Would checking the first non-NaN value or using pd.api.types.infer_dtype be more robust, balancing robustness vs. performance?
Deserialization Logic (pandas/io/parsers/readers.py): The _restore_complex_arrays function uses a heuristic (startswith("[") / endswith("]")) combined with checking all() non-null values. This seems safer than just checking brackets, but could still potentially misidentify columns if they contain a mix of valid JSON strings and other strings. Perhaps a try-except json.JSONDecodeError within the apply could be an alternative? (though likely slower if many non-JSON strings exist)

mroeschke · 2025-05-09T16:05:49Z

Thanks for the pull request, but it appears to have gone stale. Additionally it appears that there needs more discussion whether we want to support this type in exporting CSV. So going to close for now

Jaspvr added 5 commits March 13, 2025 13:19

dummy test_csv added

4489a35

Basic test

b44af37

Failing test

7bffeb6

Add preserve complex to to_csv

f0d17ae

Change csvs class to include parameter

966c104

Jaspvr added 9 commits March 27, 2025 22:23

Add functionality (still has issues) of preserve complex

5f6b5f6

Remove pandas-env from version control

4f78610

Update

3fd0248

Add param to generic to csv

4cf2399

To_csv works, need to implement read_csv

f03ff7c

Passing test using both to_csv and read_csv

2cac1bb

Clean up

798c1f2

Lint

36a26a9

Add tests to existing test file, test_to_csv.py

28f4051

Jaspvr changed the title ~~Csv func~~ Preserve Complex Data Types for to_csv Mar 28, 2025

Move tests from pytest to original test file

a012f40

Jaspvr mentioned this pull request Mar 28, 2025

Support for Storing and Retrieving Complex Data Types (e.g., Embeddings) in Pandas DataFrames #60895

Open

3 tasks

Lint and parameter description

cc07c77

rhshadrach reviewed Mar 29, 2025

View reviewed changes

rhshadrach added IO CSV read_csv, to_csv Needs Discussion Requires discussion from core team before further action Enhancement Complex Complex Numbers labels Mar 29, 2025

Fix ci issue

f422c10

simonjayhawkins linked an issue Apr 16, 2025 that may be closed by this pull request

Support for Storing and Retrieving Complex Data Types (e.g., Embeddings) in Pandas DataFrames #60895

Open

3 tasks

mroeschke closed this May 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Preserve Complex Data Types for to_csv #61157

Preserve Complex Data Types for to_csv #61157

Uh oh!

Jaspvr commented Mar 20, 2025 •

edited

Loading

Uh oh!

snitish commented Mar 23, 2025

Uh oh!

Jaspvr commented Mar 28, 2025

Uh oh!

rhshadrach left a comment

Uh oh!

rhshadrach Mar 29, 2025

Uh oh!

simonjayhawkins Apr 16, 2025

Uh oh!

a-holm commented Mar 31, 2025

Uh oh!

mroeschke commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Preserve Complex Data Types for to_csv #61157

Preserve Complex Data Types for to_csv #61157

Uh oh!

Conversation

Jaspvr commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

snitish commented Mar 23, 2025

Uh oh!

Jaspvr commented Mar 28, 2025

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

rhshadrach Mar 29, 2025

Choose a reason for hiding this comment

Uh oh!

simonjayhawkins Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

a-holm commented Mar 31, 2025

Uh oh!

mroeschke commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Jaspvr commented Mar 20, 2025 •

edited

Loading