Skip to content

Partial fix for #1078 — [Add Dataframe display config] #1086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 52 commits into from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Mar 28, 2025

Which issue does this PR close?

Closes #1078


Rationale for this change

This PR introduces a customizable display configuration for DataFrames in the Python DataFusion API. Users often need more control over how large datasets are rendered in terminals or notebooks. This feature enhances usability by allowing control over byte limits, row limits, and cell formatting during display.

This makes it easier to work with large or verbose data interactively, improving developer experience and making DataFusion more notebook-friendly.


What changes are included in this PR?

  • Introduced a new DataframeDisplayConfig class in Python for customizing DataFrame display settings:
    • max_table_bytes
    • min_table_rows
    • max_cell_length
    • max_table_rows_in_repr
  • Added a with_display_config() method to SessionContext for setting global display options.
  • Extended Rust bindings to support and pass display configuration to the PyDataFrame.
  • Updated HTML and string rendering methods (_repr_html_, __repr__) to use the provided configuration.
  • Implemented full test coverage:
    • Validation of config parameters
    • Impact on HTML and text representations
    • Truncation behavior for large data and cells
    • Byte-size display limits and expansion UI

Are there any user-facing changes?

Yes

  • Users can now customize how DataFrames are rendered:

    from datafusion import SessionContext
    
    ctx = SessionContext().with_display_config(
        max_table_bytes=1024,
        min_table_rows=10,
        max_cell_length=20,
        max_table_rows_in_repr=20,
    )

kosiew added 12 commits March 28, 2025 12:45
- Introduced DisplayConfig struct to manage display settings such as max_table_bytes, min_table_rows, and max_cell_length.
- Updated PyDataFrame to utilize DisplayConfig for rendering and displaying DataFrames.
- Added methods to configure and reset display settings, allowing users to customize their DataFrame presentation in Python.
- Added DisplayConfig struct for configuring DataFrame display in Python.
- Introduced fields: max_table_bytes, min_table_rows, and max_cell_length with default values.
- Implemented a constructor for DisplayConfig to allow optional customization.
- Updated display_config method in PyDataFrame to return a Python object of DisplayConfig.
- Introduced `configure_display` method to set customizable display options for DataFrame representation, including maximum bytes, minimum rows, and maximum cell length.
- Added `reset_display_config` method to restore default display settings.
- Implemented `display_config` property to retrieve current display configuration.
- Implemented tests for accessing and modifying display configuration properties in the DataFrame class.
- Added `test_display_config` to verify default values of display settings.
- Created `test_configure_display` to test setting and partially updating display configuration.
- Introduced `test_reset_display_config` to ensure resetting configuration restores default values.
- Added validation to ensure max_table_bytes, min_table_rows, and max_cell_length are greater than 0 in the configure_display method of DataFrame class.
- Updated test cases to cover scenarios for zero and negative values, ensuring proper error handling.
- Enhanced existing tests to validate extreme values and confirm expected behavior for display configurations.
- Updated DataFrame class to include max_table_rows_in_repr parameter for display configuration.
- Enhanced configure_display method to accept max_table_rows_in_repr.
- Modified DisplayConfig struct to include max_table_rows_in_repr with a default value of 10.
- Added tests to verify the functionality of max_table_rows_in_repr in both configuration and display output.
@kosiew kosiew force-pushed the dataframe-display-config branch from 3457121 to cae89b0 Compare March 28, 2025 09:08
Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this, but I don't think we want users to have to set it on a per-dataframe basis.

- Enhanced the DataFrame class to set display configuration at the session context level, ensuring that changes to one DataFrame's display settings affect all DataFrames created from the same context.
- Modified the PyDataFrame struct to accept a display configuration during initialization and updated methods to reference the new display_config field instead of the previous config field.
- Added tests to verify that display configurations are shared across DataFrames in the same context and remain independent across different contexts.
@kosiew
Copy link
Contributor Author

kosiew commented Mar 31, 2025

@timsaucer ,

Thanks for reviewing this.
Do you mean to set display config at the context level instead?

kosiew added 11 commits March 31, 2025 11:09
- Removed unnecessary cloning of DataFrame in various methods to enhance performance.
- Consolidated display configuration handling by removing the DisplayConfig struct and related methods.
- Updated methods to use direct references to DataFrame where applicable.
- Improved the implementation of select, filter, with_column, and other methods to work with mutable references.
- Added a new to_string method for better string representation of DataFrame.
- Cleaned up unused imports and commented-out code for better readability.
…ptions

- Introduced `DataframeDisplayConfig` struct to manage display settings for DataFrames.
- Added fields for maximum bytes, minimum rows, maximum cell length, and maximum rows in repr.
- Implemented a constructor with default values for easy initialization.
- Updated `PySessionConfig` to include `display_config` with default settings.
…fig (python)

- Introduced `with_dataframe_display_config` method in `SessionConfig` to allow customization of DataFrame display settings.
- Parameters include `max_table_bytes`, `min_table_rows`, `max_cell_length`, and `max_table_rows_in_repr` for flexible display configurations.
- Utilizes `DataframeDisplayConfig` for internal management of display settings.
…play options

- Introduced DataframeDisplayConfig to manage display settings for DataFrames.
- Added properties for max_table_bytes, min_table_rows, max_cell_length, and max_table_rows_in_repr.
- Each property includes getter and setter methods for easy configuration.
- Default values provided for each parameter to enhance usability.
- Updated `PyDataFrame` constructor to accept a `PyDataframeDisplayConfig` parameter for improved DataFrame display customization.
- Modified multiple methods in `PySessionContext` to pass the display configuration when creating `PyDataFrame` instances, ensuring consistent display settings across different DataFrame operations.
kosiew added 10 commits April 2, 2025 18:18
…eDisplayConfig

- Added a private method `_validate_positive` to encapsulate the logic for validating positive integer values.
- Updated setters for `max_table_bytes`, `min_table_rows`, `max_cell_length`, and `max_table_rows_in_repr` to use the new validation method, improving code readability and maintainability.
…lidation

- Added validation for max_table_bytes, min_table_rows, max_cell_length, and max_table_rows_in_repr to ensure positive values during initialization.
- Removed the deprecated with_dataframe_display_config method to streamline the configuration process.
@kosiew
Copy link
Contributor Author

kosiew commented Apr 3, 2025

@timsaucer
I moved the display config to session context.
Can you review again?

@kosiew kosiew requested a review from timsaucer April 3, 2025 06:39
kosiew added 11 commits April 3, 2025 15:03
- Reduced the size of test data in the `data` fixture from 100 to 10 entries for efficiency.
- Added `normalize_uuid` function to standardize UUIDs in HTML representations for consistent testing.
- Modified the `test_display_config_in_init` to use a custom display configuration and updated assertions to compare normalized HTML outputs.
- Enhanced readability of assertions in `test_display_config_affects_repr` by formatting conditions.
@timsaucer
Copy link
Contributor

This looks great. I browsed it this morning, but it's a bit long so I will try to make some time tomorrow to get a more thorough review.

@kosiew
Copy link
Contributor Author

kosiew commented Apr 24, 2025

@timsaucer

Should I move
max_table_bytes
min_table_rows: usize,
max_table_rows_in_repr: usize,

to the python DataFrameHtmlFormatter class as well?

Then we would not need a context display config.

@timsaucer
Copy link
Contributor

Moving those over to your other work sounds like a great way to have one point of processing for all of these display options. I really love how all this work is coming together!

@timsaucer timsaucer mentioned this pull request Apr 27, 2025
4 tasks
@kosiew
Copy link
Contributor Author

kosiew commented Apr 28, 2025

Closing this.

Moving the configuration from Rust to Python in #1119

@kosiew kosiew closed this Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve html table rendering formatting
2 participants