-
Notifications
You must be signed in to change notification settings - Fork 51
feat: df.join lsuffix and rsuffix support #1857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
if how == "cross": | ||
if on is not None: | ||
raise ValueError("'on' is not supported for cross join.") | ||
result_block = left._block.merge( | ||
right._block, | ||
left_join_ids=[], | ||
right_join_ids=[], | ||
suffixes=("", ""), | ||
suffixes=(lsuffix, rsuffix), | ||
how="cross", | ||
sort=True, | ||
) | ||
return DataFrame(result_block) | ||
|
||
# Join left columns with right index | ||
if on is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This if
block is getting pretty long. Might be time for a helper function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added _join_on_key function.
tests/system/small/test_dataframe.py
Outdated
["string_col", "int64_col", "int64_too"] | ||
].rename(columns={"int64_too": "int64_col"}) | ||
pd_result = pd_df_a.join(pd_df_b, how=how, lsuffix="_l", rsuffix="_r") | ||
print(pd_result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove leftover print() statements.
PS. Adding --pdb
to your pytest
command line arguments makes dropping into a debugger to inspect variables really easy. https://docs.pytest.org/en/stable/how-to/failures.html#dropping-to-pdb-on-failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
if how == "cross": | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be worth added a test that ValueError
is not raise for this condition with a cross join.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross join actually raise another error, match added.
f"bigframes_left_col_name_{i}" if col_name != on else on_col_name | ||
for i, col_name in enumerate(left_col_original_names) | ||
] | ||
left.columns = pandas.Index(left_col_temp_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems dangerous. We haven't made a copy of self, so I'm uncomfortable with mutating it. If we must do this, then please either:
- make a copy of self first
- or put a
finally
block that resets the names back to the original in case anything when wrong.
I prefer (1) since it's less likely to have problems in we're in a multi-threaded environment.
f"bigframes_left_idx_name_{i}" for i in range(len(left_idx_original_names)) | ||
] | ||
if left._has_index: | ||
left.index.names = left_idx_names_in_cols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. Mutating the index is dangerous. Can we avoid this?
f"bigframes_right_col_name_{i}" | ||
for i in range(len(right_col_original_names)) | ||
] | ||
right.columns = pandas.Index(right_col_temp_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
right_columns, | ||
lsuffix: str = "", | ||
rsuffix: str = "", | ||
extra_col: typing.Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a docstring explaining this extra_col
parameter and when it is intended to be used.
final_col_names.append(f"{col_name}{rsuffix}") | ||
else: | ||
final_col_names.append(col_name) | ||
self.columns = pandas.Index(final_col_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only be modifying self
if we're doing an inplace
operation, right? Why is self getting changed? Can we avoid this?
bf_df_b = scalars_df_index.dropna()[ | ||
["string_col", "int64_col", "int64_too"] | ||
].rename(columns={"int64_too": "int64_col"}) | ||
bf_result = bf_df_a.join(bf_df_b, how=how, lsuffix="_l", rsuffix="_r").to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some checks that bf_df_a
's column names and index names didn't get modified?
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕