-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Preserve Complex Data Types for to_csv #61157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@Jaspvr is there an existing issue that this PR addresses? If so, could you list it in the description? If not, please create an issue describing the bug or proposed enhancement so it can be reviewed by a team member. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! You may want to review our development documentation, namely this section:
https://pandas.pydata.org/pandas-docs/dev/development/contributing_codebase.html#writing-tests
@@ -3858,6 +3859,11 @@ def to_csv( | |||
|
|||
{storage_options} | |||
|
|||
preserve_complex : bool, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented in the issue, you can use the dtype
argument in read_csv
to read complex values already. I'm negative on this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV's fundamental design is meant for simple, flat data structures, and embedding complex objects within it starts to deviate from that principle. JSON arrays inside a CSV file may technically work, but it introduces challenges in readability, compatibility, and usability in spreadsheet software like Excel or Google Sheets.
With nearly 50 arguments, adding more requires careful consideration to avoid unnecessary complexity. Pandas is widely used in data processing, and every additional feature should be weighed against maintainability and ease of use.
Does it align with the expectations of CSV as a format that is universally readable? Is there significant demand for preserving complex structures within CSV, or are existing solutions (e.g., JSON, Parquet) sufficient?
It seems more reasonable to encourage users to store complex data types in formats like Parquet or JSON rather than forcing CSV to handle something it's not naturally designed for.
This PR introduces a useful feature for preserving complex data types like NumPy arrays during CSV serialization/deserialization via the Two minor points for consideration: Serialization Logic ( |
This Pull Request solves the issue outlined in:
#60895
Complex data types like numpy arrays can now be stored in csv format and later recovered.
A new parameter, preserve_complex, is introduces, and when it is set to true in your to_csv function call, the complex data types will be preserved and can be recovered from the csv.
The way this works is by serializing Numpy arrays into JSON format for preserve_complex=True. To get them from the csv, we can set the same parameter in read_csv, and the original Numpy array will be returned.
Please refer to tests in scripts/tests/test_csv.py to see how this is used.
Please refer to the original issue for more information on the problem definition.