Skip to content

Conversation

@samukweku
Copy link
Collaborator

@samukweku samukweku commented Aug 4, 2025

PR Description

Please describe the changes proposed in the pull request:

  • add cythonised code for improved performance
  • improve perf significantly with cython
  • take advantage of cython 3.0 which supports writing .py files in pure python mode, while still getting the benefits of a lower level language - in this case C.
  • the Cython code is mainly for loops.
  • add support for aggregations within conditional join
  • fairly large PR; tests remain the same - mostly refactoring and the cython functions

This PR resolves #1490 .

Example (as always with benchmarks/tests, take with a pinch of salt YMMV):

url = "https://raw.githubusercontent.com/samukweku/data-wrangling-blog/master/notebooks/Data_files/flights.csv"
flights = pd.read_csv(url, sep = '\t', names=['orig','dest','orig_time', 'dest_time'], parse_dates  = ['orig_time', 'dest_time'])
flights = flights.factorize_columns(['orig','dest']).iloc[:, 2:]
flights.columns = [ent.split("_")[0] if ent.endswith('enc') else ent for ent in flights]
flights.columns = ['takeoff','landing','orig','dest']
flights = flights.assign(start=flights.landing+pd.Timedelta(minutes=45), end=flights.landing+pd.Timedelta(hours=3))
flights.head()

                takeoff	                 landing	orig	dest	start	                  end
0	2021-11-27 07:15:00	2021-11-27 08:55:00	0	0	2021-11-27 09:40:00	2021-11-27 11:55:00
1	2021-11-27 20:05:00	2021-11-28 00:50:00	1	1	2021-11-28 01:35:00	2021-11-28 03:50:00
2	2021-11-27 21:00:00	2021-11-27 21:35:00	2	2	2021-11-27 22:20:00	2021-11-28 00:35:00
3	2021-11-27 21:15:00	2021-11-27 22:25:00	0	2	2021-11-27 23:10:00	2021-11-28 01:25:00
4	2021-11-26 11:40:00	2021-11-26 14:45:00	3	3	2021-11-26 15:30:00	2021-11-26 17:45:00

# classic Pandas merge and filter
%%timeit 
outt = (flights
        .merge(flights, left_on='dest', right_on='orig')
        .loc[lambda f: f.takeoff_y.between(f.start_x, f.end_x) & (f.orig_x != f.orig_y), 
             ['start_x', 'takeoff_y', 'start_y', 'dest_x', 'orig_y']
             ]
        )
9.85 s ± 69.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# dev, with use_numba=False
%%timeit
outer = (flights
 .conditional_join(
     flights, 
     ('end', 'takeoff' ,'>='), 
     ('start', 'takeoff', '<='), 
     ('orig','orig','!='), 
    ('dest', 'orig', '=='), 
    df_columns = ['start', 'end', 'dest'],
     right_columns = ['takeoff', 'orig'],
     use_numba=False)
)
2.03 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



# this PR, with use_numba=False
%%timeit
outerr = (flights
 .conditional_join(
     flights, 
     ('end', 'takeoff' ,'>='), 
     ('start', 'takeoff', '<='), 
     ('orig','orig','!='), 
    ('dest', 'orig', '=='), 
    df_columns = ['start', 'end', 'dest'],
     right_columns = ['takeoff', 'orig'],
     use_numba=False)
)
29.9 ms ± 470 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)





# dev, with force=True
%%timeit
out = (flights
 .conditional_join(
     flights, 
     ('end', 'takeoff' ,'>='), 
    ('start', 'takeoff', '<='), 
     ('orig','orig','!='), 
     ('dest', 'orig', '=='), 
     df_columns = ['start', 'end', 'dest'],
     right_columns = ['takeoff', 'orig'],
     force=True,
     use_numba=False)
)
255 ms ± 3.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR, with force=True
%%timeit
outt = (flights
 .conditional_join(
     flights, 
     ('end', 'takeoff' ,'>='), 
    ('start', 'takeoff', '<='), 
     ('orig','orig','!='), 
     ('dest', 'orig', '=='), 
     df_columns = ['start', 'end', 'dest'],
     right_columns = ['takeoff', 'orig'],
     force=True,
     use_numba=False)
)
83.7 ms ± 552 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.md.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@samukweku samukweku self-assigned this Aug 4, 2025
@samukweku samukweku linked an issue Aug 4, 2025 that may be closed by this pull request
@codecov
Copy link

codecov bot commented Sep 23, 2025

Codecov Report

❌ Patch coverage is 51.29418% with 1750 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.49%. Comparing base (e1b64c1) to head (2016237).
⚠️ Report is 40 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1494      +/-   ##
==========================================
- Coverage   83.49%   75.49%   -8.00%     
==========================================
  Files          88      100      +12     
  Lines        6469     7895    +1426     
==========================================
+ Hits         5401     5960     +559     
- Misses       1068     1935     +867     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@samukweku samukweku marked this pull request as draft September 24, 2025 08:30
@samukweku samukweku marked this pull request as ready for review September 24, 2025 13:06
@samukweku
Copy link
Collaborator Author

@ericmjl given the size of this PR, I will close it and create smaller PR chunks. What are your thoughts? Your thoughts on the cython inclusion? If it were implemented in rust would that be ok (possibly useful for Polars extensions for some of our Polars functions)?

@samukweku samukweku closed this Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

set up cython

2 participants