Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1897441: Fix missing row position sort in DataFrame.describe #2950

Merged
merged 2 commits into from
Jan 28, 2025

Conversation

sfc-gh-joshi
Copy link
Contributor

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1897441

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

When calling describe on a DataFrame with object columns, pandas will report a top column identifying the value that appears most frequently. If the top two values share the same frequency, pandas documentation indicates that it actually does not provide any stability guarantees:

If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from among those with the highest count.

Tests involving this behavior are currently failing on QA6, where it appears that the order of results returned by a GROUP BY/COUNT query has changed. This PR adds an additional sort on the row position column to ensure that the object value that appears first is always chosen first; this may not always agree with pandas (though pandas does this in all of our current tests), but at least keeps results the same between prod and qa6.

I ran tests/integ/modin/frame/test_describe.py with a qa6 account to verify everything passes.

@sfc-gh-joshi sfc-gh-joshi requested a review from a team as a code owner January 28, 2025 19:51
@sfc-gh-joshi sfc-gh-joshi added NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs labels Jan 28, 2025
Copy link
Contributor

@sfc-gh-lmukhopadhyay sfc-gh-lmukhopadhyay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Jonathan!

Copy link
Contributor

@sfc-gh-helmeleegy sfc-gh-helmeleegy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks Johnathan.

@sfc-gh-joshi sfc-gh-joshi enabled auto-merge (squash) January 28, 2025 22:35
@sfc-gh-joshi sfc-gh-joshi merged commit 8f0f191 into main Jan 28, 2025
38 of 39 checks passed
@sfc-gh-joshi sfc-gh-joshi deleted the joshi-SNOW-1897441-qa6-describe branch January 28, 2025 23:32
@github-actions github-actions bot locked and limited conversation to collaborators Jan 28, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md NO-PANDAS-CHANGEDOC-UPDATES This PR does not update Snowpark pandas docs snowpark-pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants