-
Notifications
You must be signed in to change notification settings - Fork 112
Improve collection during repr and repr_html #1036
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! very cool. Thanks for adding this.
src/dataframe.rs
Outdated
// let (total_memory, total_rows) = batches.iter().fold((0, 0), |acc, batch| { | ||
// (acc.0 + batch.get_array_memory_size(), acc.1 + batch.num_rows()) | ||
// }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit remove commented out code
323b5a0
to
161e38e
Compare
I am testing my updated consolidated code now. When running on a 10 GB scale factor TPC-H query, I get comparable times for both q1 and q2, which take 10 and 53 s on my m4 pro to run at that scale. I will next test on 1gb scale factor and then the tiny batches that were discussed in #1015 One metric is comparing For q2 for example: df.show(): 53.881720781326294 When dropping down to a 1GB data set df.show() took: 0.8244500160217285 The same 1GB against main df.show() took: 0.8473942279815674 Finally, for the tiny dataset (increased to 3 record batches so we do get multiple processing steps) Average runtime over 100 runs: 0.001016 seconds (this branch) And lastly, to verify it also resolves #1014 ![]() |
def test_dataframe_repr_html(df) -> None: | ||
output = df._repr_html_() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be a good idea to test for other df fixtures too?
In addition, maybe add an empty_df fixture for tests too
eg
@pytest.fixture
def empty_df():
ctx = SessionContext()
# Create an empty RecordBatch with the same schema as df
batch = pa.RecordBatch.from_arrays(
[
pa.array([], type=pa.int64()),
pa.array([], type=pa.int64()),
pa.array([], type=pa.int64()),
],
names=["a", "b", "c"],
)
return ctx.from_arrow(batch)
@pytest.mark.parametrize(
"dataframe_fixture",
["empty_df", "df", "nested_df", "struct_df", "partitioned_df", "aggregate_df"],
)
def test_dataframe_repr_html(request, dataframe_fixture) -> None:
df = request.getfixturevalue(dataframe_fixture)
output = df._repr_html_()
fn _repr_html_(&self, py: Python) -> PyDataFusionResult<String> { | ||
let (batches, has_more) = wait_for_future( | ||
py, | ||
collect_record_batches_to_display( | ||
self.df.as_ref().clone(), | ||
MIN_TABLE_ROWS_TO_DISPLAY, | ||
usize::MAX, | ||
), | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extracting some variables into helper functions could make this more readable and easier to maintain eg:
fn _repr_html_(&self, py: Python) -> PyDataFusionResult<String> {
let (batches, has_more) = wait_for_future(
py,
collect_record_batches_to_display(
self.df.as_ref().clone(),
MIN_TABLE_ROWS_TO_DISPLAY,
usize::MAX,
),
)?;
if batches.is_empty() {
// This should not be reached, but do it for safety since we index into the vector below
return Ok("No data to display".to_string());
}
let table_uuid = uuid::Uuid::new_v4().to_string();
let schema = batches[0].schema();
// Get table formatters for displaying cell values
let batch_formatters = get_batch_formatters(&batches)?;
let rows_per_batch = batches.iter().map(|batch| batch.num_rows());
// Generate HTML components
let mut html_str = generate_html_table_header(&schema);
html_str.push_str(&generate_table_rows(
&batch_formatters,
rows_per_batch,
&table_uuid
)?);
html_str.push_str("</tbody></table></div>\n");
html_str.push_str(&generate_javascript());
if has_more {
html_str.push_str("Data truncated due to size.");
}
Ok(html_str)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to follow on issue #1078
src/dataframe.rs
Outdated
min_rows: usize, | ||
max_rows: usize, | ||
) -> Result<(Vec<RecordBatch>, bool), DataFusionError> { | ||
let mut stream = df.execute_stream().await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my proposed PR #1015 , I uses execute_stream_partitioned
instead.
execute_stream
will append CoalescePartitionsExec
to merge partitions into a single partition(code https://github.com/apache/datafusion/blob/74aeb91fd94109d05178555d83e812e6e0712573/datafusion/physical-plan/src/execution_plan.rs#L887C1-L889C1 ). This will load unnecessary partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll switch to your approach
let ratio = MAX_TABLE_BYTES_TO_DISPLAY as f32 / size_estimate_so_far as f32; | ||
let total_rows = rows_in_rb + rows_so_far; | ||
|
||
let mut reduced_row_num = (total_rows as f32 * ratio).round() as usize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This estimation is not so accurate if some rows skew in size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And the data size is an estimate as well. The point is to get a general ball park and not necessarily an exact measure. It should still indicate that that data have been truncated.
@@ -70,6 +72,9 @@ impl PyTableProvider { | |||
PyTable::new(table_provider) | |||
} | |||
} | |||
const MAX_TABLE_BYTES_TO_DISPLAY: usize = 2 * 1024 * 1024; // 2 MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about make this configurable? 2MB still can mean lots of rows as the upper bound is usize::MAX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about I open an issue to enhance this to be configurable as well as the follow on part about disabling the styling? I'd like to get this in so we fix explain and add some useful functionality now and then we can get these things tightened up in the next iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to issue #1078
} | ||
</style> | ||
|
||
<div style=\"width: 100%; max-width: 1000px; max-height: 300px; overflow: auto; border: 1px solid #ccc;\"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel so positive to add hardcoded styles especially absolute width/margin. If Jupyter modify their style/layout or users apply customization based on Jupyter UI, there might be incompatibility. At least, there should be a switch to turn off it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to issue #1078
…he table scrollable and displaying the first record batch up to 2MB
…click on a button to toggle showing more or less
…2MB limit, so switch over to collecting until we run out or use up the size
…ist and only check the table contents
d034685
to
2882050
Compare
Which issue does this PR close?
None.
Rationale for this change
The notebook rendering of DataFrames is very useful, but it can be enhanced. This PR adds quality of life improvements such as
...
button to allow expanding the cell so you can view it in it's entiretyWhat changes are included in this PR?
This PR adds a feature to collect record batches and uses their size estimate to collect up to 2MB worth of data. This is typically enough for most use cases to review the data, but it is a constant we can update. We determine how many rows to show to the user which is either 2MB worth (record batch will easily have more than this) or at least 20 rows (also up for changing). We then render this as a html table
In the rendering we see if the individual cell contains more than 25 characters. If so we show a 25 character snippet of the string representation of the data and a
...
button that has a javascript call to update which data are displayed in the cell.Are there any user-facing changes?
Yes, but not to the API. Any user who uses jupyter notebooks will experience these enhanced tables.
See the below screenshots for examples:


