-
Notifications
You must be signed in to change notification settings - Fork 11
More receiving yards parsing cases #120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Removed deprecated labels for elapsed time columns in cfbd_drives and added new columns to cfbd_live_plays documentation and tests. Updated test expectations to use expect_in instead of expect_setequal for column checks.
Added new columns to cfbd_play_stats_player output, improved sack player aggregation, handled NULL values, and updated documentation and examples to reflect changes. Also updated cfbd_live_plays documentation to include new columns for average start yard line and deserve to win metrics.
Changed cfbd_pbp_data to assign 3 timeouts per half for offense and defense when timeout data is missing from the API. Updated documentation and examples to reflect this behavior.
Added .groups = "drop" to the dplyr::summarise call in add_play_counts to control grouping behavior and prevent potential warnings in future dplyr versions.
Removed the specific count of variables from the return value description in both R and Rd files to improve maintainability and accuracy as the data frame structure may change.
… function usage in play data functions Corrected spacing and replaced superseded `dplyr::distinct_all()` with `dplyr::distinct()`, and standardized assignment spacing for improved code readability and consistency.
Added a check to filter out games with fewer than 20 plays in the play-by-play data processing. This helps avoid issues with EPA/WPA models and improves data validation.
Update package version to 2.1.0. Add release notes for bug fixes in `cfbd_pbp_data()` and improvements to `add_yardage()` handling missing yardage values. Update cran-comments to reflect minor release and changes.
Added normalization for 'seasonType' to 'season_type' in cfbd_stats_game_advanced. Updated tests to check for column inclusion with expect_in instead of expect_setequal, and added team ID columns to betting lines test.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📝 WalkthroughWalkthroughThis PR adds play-text preprocessing via a new exported clean_play_text(play_df), updates the yardage-extraction pipeline to use cleaned_text, expands regex patterns for rushing/passing/receiving yard parsing, updates NAMESPACE/DESCRIPTION/docs/tests, and integrates the cleaning step into cfbd_pbp_data's per-game processing. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
R/helper_pbp_add_yardage.R (1)
44-56: Document the dependency oncleaned_textcolumn.The function now relies on a
cleaned_textcolumn (used starting at line 58) but doesn't validate its presence or document this requirement. Consider either:
- Adding parameter documentation that
play_dfmust contain acleaned_textcolumn- Adding a validation check at the start of the function
- Creating the column if it doesn't exist by calling
clean_play_text()🔎 Example validation check
add_yardage <- function(play_df) { + if (!"cleaned_text" %in% names(play_df)) { + stop("play_df must contain a 'cleaned_text' column. Call clean_play_text() first.") + } + play_df$yds_rushed <- NA_real_ play_df$yds_receiving <- NA_real_
🧹 Nitpick comments (1)
R/cfbd_pbp_data.R (1)
2168-2179: Consider refactoring for efficiency and completeness.The function could be improved in several ways:
- Efficiency: Multiple
str_replacecalls that reassign to the same column are less efficient than chaining operations or usingstr_replace_allwith multiple patterns.- Whitespace handling: Consider adding
str_trim()at the end to remove leading/trailing whitespace that may result from the replacements.- Pattern consolidation: Some patterns could potentially be combined (e.g., the two "No Huddle" patterns on lines 2174-2175).
💡 Example refactored implementation
clean_play_text <- function(play_df) { play_df <- play_df %>% dplyr::mutate( cleaned_text = .data$play_text %>% stringr::str_replace("^\\(\\d{1,2}:\\d{2}\\)\\s+", "") %>% stringr::str_replace("\\s\\b(short|deep)\\b\\s", " ") %>% stringr::str_replace("\\s\\b(left|middle|right)\\b\\s", " ") %>% stringr::str_replace("\\s*No Huddle-Shotgun\\s+", "") %>% stringr::str_replace("No Huddle-?", "") %>% stringr::str_replace("\\s*Shotgun\\s+", "") %>% stringr::str_replace("\\s+", " ") %>% stringr::str_trim() ) }
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
DESCRIPTIONNAMESPACER/cfbd_pbp_data.RR/helper_pbp_add_yardage.RcfbfastR.Rprojman/helpers_pbp.Rdtests/testthat/test-cfbd_pbp_data.R
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-25T01:24:22.513Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_stats.R:381-381
Timestamp: 2025-08-25T01:24:22.513Z
Learning: In cfbfastR's cfbd_stats_season_advanced() function, the gsub("Opportunies", "_opportunities", colnames(df)) is intentional and correct for handling the actual API response column naming from CollegeFootballData.com, not a typo as it might appear.
Applied to files:
R/helper_pbp_add_yardage.RR/cfbd_pbp_data.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: ubuntu-latest (release)
- GitHub Check: windows-latest (release)
- GitHub Check: ubuntu-latest (oldrel-1)
🔇 Additional comments (6)
DESCRIPTION (1)
61-61: LGTM! Appropriate dependency addition.Adding
patrickfor parameterized testing aligns with the test updates mentioned in the PR summary.NAMESPACE (1)
70-70: LGTM! Appropriate NAMESPACE updates.The new export and import declarations correctly support the
clean_play_text()function.Also applies to: 152-152
man/helpers_pbp.Rd (1)
11-11: LGTM! Complete documentation for the new function.The documentation properly describes
clean_play_text()including its purpose, usage, and return values. The note about ESPN PBP changes in 2025 provides helpful context.Also applies to: 26-26, 181-184, 342-345
R/cfbd_pbp_data.R (1)
598-598: LGTM! Appropriate pipeline placement.Calling
clean_play_text()beforepenalty_detection()ensures the text is cleaned early in the processing pipeline.R/helper_pbp_add_yardage.R (1)
157-238: Inconsistent use ofcleaned_textvsplay_textacross yardage extractions should be clarified.The
yds_rushedandyds_receivingcalculations usecleaned_text(lines 58–150), whileyds_int_return,yds_kickoff,yds_punted,yds_fumble_return, andyds_sackedstill useplay_text(lines 157–238).The
cleaned_textfunction removes pass depth/direction qualifiers (short/deep/left/middle/right), clock timestamps, and huddle/shotgun markers. These removals likely benefit rush and receiving patterns but may not affect special teams or interception patterns, which use different regex anchors. However, this design decision lacks documentation. Either:
- Add a comment explaining why only rush/receiving need
cleaned_textand others don't- Apply
cleaned_textconsistently across all yardage extractions for uniformity- Document in the function's roxygen comments which yardage types require text cleaning
tests/testthat/test-cfbd_pbp_data.R (1)
57-59: No action needed. The test filtering mechanism will work correctly.The
cfbd_pbp_data()function preserves the originalplay_textcolumn. Theclean_play_text()function creates a separatecleaned_textcolumn viadplyr::mutate()without modifyingplay_text. The exact string matching on line 57 using the full timestamp strings like"(14:46) Shotgun #10 H.King..."will find the intended plays as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
R/helper_pbp_add_yardage.Rtests/testthat/test-cfbd_pbp_data.Rtests/testthat/test-cfbd_play_stats_types.Rtests/testthat/test-cfbd_play_types.R
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/testthat/test-cfbd_pbp_data.R
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-25T01:24:22.513Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_stats.R:381-381
Timestamp: 2025-08-25T01:24:22.513Z
Learning: In cfbfastR's cfbd_stats_season_advanced() function, the gsub("Opportunies", "_opportunities", colnames(df)) is intentional and correct for handling the actual API response column naming from CollegeFootballData.com, not a typo as it might appear.
Applied to files:
R/helper_pbp_add_yardage.R
📚 Learning: 2025-08-25T01:47:09.915Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_play.R:731-733
Timestamp: 2025-08-25T01:47:09.915Z
Learning: In the cfbfastR codebase, the pattern `tibble::tibble(data = .data$.)` in R/cfbd_play.R is valid and should not be flagged as incorrect usage of the .data pronoun.
Applied to files:
tests/testthat/test-cfbd_play_stats_types.Rtests/testthat/test-cfbd_play_types.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: windows-latest (release)
- GitHub Check: ubuntu-latest (oldrel-1)
- GitHub Check: ubuntu-latest (release)
🔇 Additional comments (5)
tests/testthat/test-cfbd_play_stats_types.R (1)
5-12: LGTM - Test expectation updated for new play stat type.The row count expectation reflects an additional play stat type from the API. Note that this test is coupled to external data and may require future updates if the API schema changes.
R/helper_pbp_add_yardage.R (3)
55-58: Good defensive initialization of cleaned_text.This fallback ensures the function works correctly whether or not the caller has pre-processed the data with
clean_play_text().
62-113: Previous regex alternation issue resolved by restructuring.The lookbehind patterns have been correctly split into separate branches for "run" and "rush" variants, eliminating the parsing ambiguity flagged in the previous review. Each pattern is now unambiguous (e.g.,
"(?<= run for a loss of)"vs"(?<= rush for a loss of)").
158-167: Verify regex pattern combination for PASSER pass edge case.Lines 163-167 check for both
"pass$"(end of string) AND"^to "(start of string) in the samecleaned_text. This seems contradictory unlessclean_play_text()transforms the text in a way that makes both patterns matchable.Could you verify this logic works as intended with sample 2025 play data? If
cleaned_textcannot simultaneously end with "pass" and start with "to ", this branch will never match.tests/testthat/test-cfbd_play_types.R (1)
5-12: LGTM - Test expectation updated for new play type.The row count expectation reflects an additional play type from the API. Same consideration as the play stats types test regarding external data coupling.
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.