Improve collection during repr and repr_html #1036

timsaucer · 2025-03-01T15:51:15Z

Which issue does this PR close?

None.

Rationale for this change

The notebook rendering of DataFrames is very useful, but it can be enhanced. This PR adds quality of life improvements such as

The table is now scrollable both vertically and horizontally
Instead of collecting an arbitrary 10 rows, we collect up to 2 MB worth of data
For Scalars that render to long strings (25 characters) we limit them down and have a ... button to allow expanding the cell so you can view it in it's entirety
When we have more data available than is displayed we indicate this to the user that the data are truncated
When there are no data returned, we write this to the user

What changes are included in this PR?

This PR adds a feature to collect record batches and uses their size estimate to collect up to 2MB worth of data. This is typically enough for most use cases to review the data, but it is a constant we can update. We determine how many rows to show to the user which is either 2MB worth (record batch will easily have more than this) or at least 20 rows (also up for changing). We then render this as a html table

In the rendering we see if the individual cell contains more than 25 characters. If so we show a 25 character snippet of the string representation of the data and a ... button that has a javascript call to update which data are displayed in the cell.

Are there any user-facing changes?

Yes, but not to the API. Any user who uses jupyter notebooks will experience these enhanced tables.

See the below screenshots for examples:

kevinjqliu

LGTM! very cool. Thanks for adding this.

kevinjqliu · 2025-03-04T16:11:28Z

src/dataframe.rs

+        // let (total_memory, total_rows) = batches.iter().fold((0, 0), |acc, batch| {
+        //     (acc.0 + batch.get_array_memory_size(), acc.1 + batch.num_rows())
+        // });


nit remove commented out code

timsaucer · 2025-03-12T23:07:51Z

Still needs careful testing and performance evaluation, but trying to pull in methods for this issue, #1041 and #1015 into a single PR

timsaucer · 2025-03-13T12:07:54Z

I am testing my updated consolidated code now. When running on a 10 GB scale factor TPC-H query, I get comparable times for both q1 and q2, which take 10 and 53 s on my m4 pro to run at that scale. I will next test on 1gb scale factor and then the tiny batches that were discussed in #1015

One metric is comparing df.show() with df.__repr__(). The former calls the previous code essentially. The latter is the updated call. I also tested against main to find comparable values.

For q2 for example:

df.show(): 53.881720781326294
df.repr(): 52.33351922035217

When dropping down to a 1GB data set

df.show() took: 0.8244500160217285
df.repr() took 0.8161180019378662

The same 1GB against main

df.show() took: 0.8473942279815674
df.repr() took 0.8100850582122803

Finally, for the tiny dataset (increased to 3 record batches so we do get multiple processing steps)

Average runtime over 100 runs: 0.001016 seconds (this branch)
Average runtime over 100 runs: 0.001011 seconds (main)

And lastly, to verify it also resolves #1014

timsaucer · 2025-03-15T11:17:13Z

@konjac would you mind reviewing this? I pulled a portion of your code/idea into it so that this can supersede #1015

kosiew · 2025-03-19T01:29:19Z

python/tests/test_dataframe.py

 def test_dataframe_repr_html(df) -> None:
    output = df._repr_html_()


Would it be a good idea to test for other df fixtures too?

In addition, maybe add an empty_df fixture for tests too
eg

@pytest.fixture def empty_df(): ctx = SessionContext() # Create an empty RecordBatch with the same schema as df batch = pa.RecordBatch.from_arrays( [ pa.array([], type=pa.int64()), pa.array([], type=pa.int64()), pa.array([], type=pa.int64()), ], names=["a", "b", "c"], ) return ctx.from_arrow(batch) @pytest.mark.parametrize( "dataframe_fixture", ["empty_df", "df", "nested_df", "struct_df", "partitioned_df", "aggregate_df"], ) def test_dataframe_repr_html(request, dataframe_fixture) -> None: df = request.getfixturevalue(dataframe_fixture) output = df._repr_html_()

kosiew · 2025-03-19T01:38:16Z

src/dataframe.rs

+    fn _repr_html_(&self, py: Python) -> PyDataFusionResult<String> {
+        let (batches, has_more) = wait_for_future(
+            py,
+            collect_record_batches_to_display(
+                self.df.as_ref().clone(),
+                MIN_TABLE_ROWS_TO_DISPLAY,
+                usize::MAX,
+            ),
+        )?;


Extracting some variables into helper functions could make this more readable and easier to maintain eg:

fn _repr_html_(&self, py: Python) -> PyDataFusionResult<String> { let (batches, has_more) = wait_for_future( py, collect_record_batches_to_display( self.df.as_ref().clone(), MIN_TABLE_ROWS_TO_DISPLAY, usize::MAX, ), )?; if batches.is_empty() { // This should not be reached, but do it for safety since we index into the vector below return Ok("No data to display".to_string()); } let table_uuid = uuid::Uuid::new_v4().to_string(); let schema = batches[0].schema(); // Get table formatters for displaying cell values let batch_formatters = get_batch_formatters(&batches)?; let rows_per_batch = batches.iter().map(|batch| batch.num_rows()); // Generate HTML components let mut html_str = generate_html_table_header(&schema); html_str.push_str(&generate_table_rows( &batch_formatters, rows_per_batch, &table_uuid )?); html_str.push_str("</tbody></table></div>\n"); html_str.push_str(&generate_javascript()); if has_more { html_str.push_str("Data truncated due to size."); } Ok(html_str) }

Added to follow on issue #1078

konjac · 2025-03-21T13:18:27Z

src/dataframe.rs

+    min_rows: usize,
+    max_rows: usize,
+) -> Result<(Vec<RecordBatch>, bool), DataFusionError> {
+    let mut stream = df.execute_stream().await?;


In my proposed PR #1015 , I uses execute_stream_partitioned instead.

execute_stream will append CoalescePartitionsExec to merge partitions into a single partition(code https://github.com/apache/datafusion/blob/74aeb91fd94109d05178555d83e812e6e0712573/datafusion/physical-plan/src/execution_plan.rs#L887C1-L889C1 ). This will load unnecessary partitions.

I'll switch to your approach

konjac · 2025-03-21T13:45:29Z

src/dataframe.rs

+                let ratio = MAX_TABLE_BYTES_TO_DISPLAY as f32 / size_estimate_so_far as f32;
+                let total_rows = rows_in_rb + rows_so_far;
+
+                let mut reduced_row_num = (total_rows as f32 * ratio).round() as usize;


This estimation is not so accurate if some rows skew in size.

Yes. And the data size is an estimate as well. The point is to get a general ball park and not necessarily an exact measure. It should still indicate that that data have been truncated.

konjac · 2025-03-21T13:48:48Z

src/dataframe.rs

@@ -70,6 +72,9 @@ impl PyTableProvider {
        PyTable::new(table_provider)
    }
 }
+const MAX_TABLE_BYTES_TO_DISPLAY: usize = 2 * 1024 * 1024; // 2 MB


How about make this configurable? 2MB still can mean lots of rows as the upper bound is usize::MAX

How about I open an issue to enhance this to be configurable as well as the follow on part about disabling the styling? I'd like to get this in so we fix explain and add some useful functionality now and then we can get these things tightened up in the next iteration.

Added to issue #1078

konjac · 2025-03-21T14:13:42Z

src/dataframe.rs

+            }
+        </style>
+
+        <div style=\"width: 100%; max-width: 1000px; max-height: 300px; overflow: auto; border: 1px solid #ccc;\">


I don't feel so positive to add hardcoded styles especially absolute width/margin. If Jupyter modify their style/layout or users apply customization based on Jupyter UI, there might be incompatibility. At least, there should be a switch to turn off it.

Added to issue #1078

…he table scrollable and displaying the first record batch up to 2MB

…click on a button to toggle showing more or less

…2MB limit, so switch over to collecting until we run out or use up the size

…ist and only check the table contents

timsaucer self-assigned this Mar 1, 2025

kevinjqliu approved these changes Mar 4, 2025

View reviewed changes

kevinjqliu mentioned this pull request Mar 4, 2025

_repr_ and _html_repr_ show '... and additional rows' message #1041

Closed

timsaucer force-pushed the feat/scrollable_html_render branch from 323b5a0 to 161e38e Compare March 12, 2025 23:06

timsaucer changed the title ~~Scrollable python notebook table rendering~~ Improve collection during repr and repr_html Mar 14, 2025

kosiew reviewed Mar 19, 2025

View reviewed changes

konjac reviewed Mar 21, 2025

View reviewed changes

timsaucer mentioned this pull request Mar 22, 2025

Improve html table rendering formatting #1078

Closed

timsaucer added 8 commits March 22, 2025 08:50

Improve table readout of a dataframe in jupyter notebooks by making t…

5dac633

…he table scrollable and displaying the first record batch up to 2MB

Add option to only display a portion of a cell data and the user can …

6307498

…click on a button to toggle showing more or less

We cannot expect that the first non-empy batch is sufficient for our …

f7c1861

…2MB limit, so switch over to collecting until we run out or use up the size

Update python unit test to allow the additional formatting data to ex…

ee3864b

…ist and only check the table contents

Combining collection for repr and repr_html into one function

933d48c

Small clippy suggestion

3ba4d9d

Collect was occuring twice on repr

cacba9b

Switch to execute_stream_partitioned

2882050

timsaucer force-pushed the feat/scrollable_html_render branch from d034685 to 2882050 Compare March 22, 2025 13:04

timsaucer merged commit 42982da into apache:main Mar 22, 2025
17 checks passed

timsaucer deleted the feat/scrollable_html_render branch March 22, 2025 14:15

This was referenced Jun 13, 2025

Unable to print EXPLAIN #1014

Closed

refactor: collect dataframe as stream in __repr__ #1015

Closed

		def test_dataframe_repr_html(df) -> None:
		output = df._repr_html_()

Improve collection during repr and repr_html #1036

Improve collection during repr and repr_html #1036

Uh oh!

Conversation

timsaucer commented Mar 1, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Mar 12, 2025

Uh oh!

timsaucer commented Mar 13, 2025

Uh oh!

timsaucer commented Mar 15, 2025

Uh oh!

kosiew Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

konjac Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kosiew Mar 19, 2025 •

edited

Loading

konjac Mar 21, 2025 •

edited

Loading