Skip to content

Improve collection during repr and repr_html #1036

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 22, 2025

Conversation

timsaucer
Copy link
Contributor

Which issue does this PR close?

None.

Rationale for this change

The notebook rendering of DataFrames is very useful, but it can be enhanced. This PR adds quality of life improvements such as

  • The table is now scrollable both vertically and horizontally
  • Instead of collecting an arbitrary 10 rows, we collect up to 2 MB worth of data
  • For Scalars that render to long strings (25 characters) we limit them down and have a ... button to allow expanding the cell so you can view it in it's entirety
  • When we have more data available than is displayed we indicate this to the user that the data are truncated
  • When there are no data returned, we write this to the user

What changes are included in this PR?

This PR adds a feature to collect record batches and uses their size estimate to collect up to 2MB worth of data. This is typically enough for most use cases to review the data, but it is a constant we can update. We determine how many rows to show to the user which is either 2MB worth (record batch will easily have more than this) or at least 20 rows (also up for changing). We then render this as a html table

In the rendering we see if the individual cell contains more than 25 characters. If so we show a 25 character snippet of the string representation of the data and a ... button that has a javascript call to update which data are displayed in the cell.

Are there any user-facing changes?

Yes, but not to the API. Any user who uses jupyter notebooks will experience these enhanced tables.

See the below screenshots for examples:
table-views-1
table-views-2
table-views-3

@timsaucer timsaucer self-assigned this Mar 1, 2025
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! very cool. Thanks for adding this.

src/dataframe.rs Outdated
Comment on lines 183 to 185
// let (total_memory, total_rows) = batches.iter().fold((0, 0), |acc, batch| {
// (acc.0 + batch.get_array_memory_size(), acc.1 + batch.num_rows())
// });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit remove commented out code

@timsaucer
Copy link
Contributor Author

Still needs careful testing and performance evaluation, but trying to pull in methods for this issue, #1041 and #1015 into a single PR

@timsaucer
Copy link
Contributor Author

I am testing my updated consolidated code now. When running on a 10 GB scale factor TPC-H query, I get comparable times for both q1 and q2, which take 10 and 53 s on my m4 pro to run at that scale. I will next test on 1gb scale factor and then the tiny batches that were discussed in #1015

render-with-limited-returns

One metric is comparing df.show() with df.__repr__(). The former calls the previous code essentially. The latter is the updated call. I also tested against main to find comparable values.

For q2 for example:

df.show(): 53.881720781326294
df.repr(): 52.33351922035217

When dropping down to a 1GB data set

df.show() took: 0.8244500160217285
df.repr() took 0.8161180019378662

The same 1GB against main

df.show() took: 0.8473942279815674
df.repr() took 0.8100850582122803

Finally, for the tiny dataset (increased to 3 record batches so we do get multiple processing steps)

Average runtime over 100 runs: 0.001016 seconds (this branch)
Average runtime over 100 runs: 0.001011 seconds (main)

And lastly, to verify it also resolves #1014

image

@timsaucer timsaucer changed the title Scrollable python notebook table rendering Improve collection during repr and repr_html Mar 14, 2025
@timsaucer
Copy link
Contributor Author

@konjac would you mind reviewing this? I pulled a portion of your code/idea into it so that this can supersede #1015

Comment on lines 1192 to 1247
def test_dataframe_repr_html(df) -> None:
output = df._repr_html_()
Copy link
Contributor

@kosiew kosiew Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be a good idea to test for other df fixtures too?

In addition, maybe add an empty_df fixture for tests too
eg

@pytest.fixture
def empty_df():
    ctx = SessionContext()

    # Create an empty RecordBatch with the same schema as df
    batch = pa.RecordBatch.from_arrays(
        [
            pa.array([], type=pa.int64()),
            pa.array([], type=pa.int64()),
            pa.array([], type=pa.int64()),
        ],
        names=["a", "b", "c"],
    )

    return ctx.from_arrow(batch)


@pytest.mark.parametrize(
    "dataframe_fixture",
    ["empty_df", "df", "nested_df", "struct_df", "partitioned_df", "aggregate_df"],
)
def test_dataframe_repr_html(request, dataframe_fixture) -> None:
    df = request.getfixturevalue(dataframe_fixture)
    output = df._repr_html_()

Comment on lines +139 to +147
fn _repr_html_(&self, py: Python) -> PyDataFusionResult<String> {
let (batches, has_more) = wait_for_future(
py,
collect_record_batches_to_display(
self.df.as_ref().clone(),
MIN_TABLE_ROWS_TO_DISPLAY,
usize::MAX,
),
)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracting some variables into helper functions could make this more readable and easier to maintain eg:

    fn _repr_html_(&self, py: Python) -> PyDataFusionResult<String> {
        let (batches, has_more) = wait_for_future(
            py,
            collect_record_batches_to_display(
                self.df.as_ref().clone(),
                MIN_TABLE_ROWS_TO_DISPLAY,
                usize::MAX,
            ),
        )?;
        if batches.is_empty() {
            // This should not be reached, but do it for safety since we index into the vector below
            return Ok("No data to display".to_string());
        }

        let table_uuid = uuid::Uuid::new_v4().to_string();
        let schema = batches[0].schema();
        
        // Get table formatters for displaying cell values
        let batch_formatters = get_batch_formatters(&batches)?;
        let rows_per_batch = batches.iter().map(|batch| batch.num_rows());

        // Generate HTML components
        let mut html_str = generate_html_table_header(&schema);
        html_str.push_str(&generate_table_rows(
            &batch_formatters, 
            rows_per_batch, 
            &table_uuid
        )?);
        html_str.push_str("</tbody></table></div>\n");
        html_str.push_str(&generate_javascript());

        if has_more {
            html_str.push_str("Data truncated due to size.");
        }

        Ok(html_str)
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to follow on issue #1078

src/dataframe.rs Outdated
min_rows: usize,
max_rows: usize,
) -> Result<(Vec<RecordBatch>, bool), DataFusionError> {
let mut stream = df.execute_stream().await?;
Copy link

@konjac konjac Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my proposed PR #1015 , I uses execute_stream_partitioned instead.

execute_stream will append CoalescePartitionsExec to merge partitions into a single partition(code https://github.com/apache/datafusion/blob/74aeb91fd94109d05178555d83e812e6e0712573/datafusion/physical-plan/src/execution_plan.rs#L887C1-L889C1 ). This will load unnecessary partitions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll switch to your approach

let ratio = MAX_TABLE_BYTES_TO_DISPLAY as f32 / size_estimate_so_far as f32;
let total_rows = rows_in_rb + rows_so_far;

let mut reduced_row_num = (total_rows as f32 * ratio).round() as usize;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This estimation is not so accurate if some rows skew in size.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And the data size is an estimate as well. The point is to get a general ball park and not necessarily an exact measure. It should still indicate that that data have been truncated.

@@ -70,6 +72,9 @@ impl PyTableProvider {
PyTable::new(table_provider)
}
}
const MAX_TABLE_BYTES_TO_DISPLAY: usize = 2 * 1024 * 1024; // 2 MB
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about make this configurable? 2MB still can mean lots of rows as the upper bound is usize::MAX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about I open an issue to enhance this to be configurable as well as the follow on part about disabling the styling? I'd like to get this in so we fix explain and add some useful functionality now and then we can get these things tightened up in the next iteration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to issue #1078

}
</style>

<div style=\"width: 100%; max-width: 1000px; max-height: 300px; overflow: auto; border: 1px solid #ccc;\">
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel so positive to add hardcoded styles especially absolute width/margin. If Jupyter modify their style/layout or users apply customization based on Jupyter UI, there might be incompatibility. At least, there should be a switch to turn off it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to issue #1078

@timsaucer timsaucer force-pushed the feat/scrollable_html_render branch from d034685 to 2882050 Compare March 22, 2025 13:04
@timsaucer timsaucer merged commit 42982da into apache:main Mar 22, 2025
17 checks passed
@timsaucer timsaucer deleted the feat/scrollable_html_render branch March 22, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants