Skip to content

Conversation

@akeaswaran
Copy link
Collaborator

@akeaswaran akeaswaran commented Jan 4, 2026

Summary by CodeRabbit

  • New Features

    • Play descriptions are now cleaned and normalized to improve parsing accuracy.
    • Yardage parsing has been enhanced to recognize more rushing and passing text patterns.
  • Tests

    • Added extended tests covering 2025 play-by-play yardage cases and updated expectations for play-type/lookups.

✏️ Tip: You can customize this high-level summary in your review settings.

akeaswaran and others added 18 commits September 3, 2025 14:01
Removed deprecated labels for elapsed time columns in cfbd_drives and added new columns to cfbd_live_plays documentation and tests. Updated test expectations to use expect_in instead of expect_setequal for column checks.
Added new columns to cfbd_play_stats_player output, improved sack player aggregation, handled NULL values, and updated documentation and examples to reflect changes. Also updated cfbd_live_plays documentation to include new columns for average start yard line and deserve to win metrics.
Changed cfbd_pbp_data to assign 3 timeouts per half for offense and defense when timeout data is missing from the API. Updated documentation and examples to reflect this behavior.
Added .groups = "drop" to the dplyr::summarise call in add_play_counts to control grouping behavior and prevent potential warnings in future dplyr versions.
Removed the specific count of variables from the return value description in both R and Rd files to improve maintainability and accuracy as the data frame structure may change.
… function usage in play data functions

Corrected spacing and replaced superseded `dplyr::distinct_all()` with `dplyr::distinct()`, and standardized assignment spacing for improved code readability and consistency.
Added a check to filter out games with fewer than 20 plays in the play-by-play data processing. This helps avoid issues with EPA/WPA models and improves data validation.
Update package version to 2.1.0. Add release notes for bug fixes in `cfbd_pbp_data()` and improvements to `add_yardage()` handling missing yardage values. Update cran-comments to reflect minor release and changes.
Added normalization for 'seasonType' to 'season_type' in cfbd_stats_game_advanced. Updated tests to check for column inclusion with expect_in instead of expect_setequal, and added team ID columns to betting lines test.
@vercel
Copy link

vercel bot commented Jan 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Review Updated (UTC)
cfbfastr Ready Ready Preview, Comment Jan 4, 2026 6:15pm

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 4, 2026

📝 Walkthrough

Walkthrough

This PR adds play-text preprocessing via a new exported clean_play_text(play_df), updates the yardage-extraction pipeline to use cleaned_text, expands regex patterns for rushing/passing/receiving yard parsing, updates NAMESPACE/DESCRIPTION/docs/tests, and integrates the cleaning step into cfbd_pbp_data's per-game processing.

Changes

Cohort / File(s) Summary
Dependency Management
DESCRIPTION
Added patrick to Suggests for parameterized testing.
Text Preprocessing & Integration
R/cfbd_pbp_data.R, man/helpers_pbp.Rd
Added exported clean_play_text(play_df) and wired it into the PBP processing pipeline before penalty detection; documented function in man/helpers_pbp.Rd.
Yardage Extraction Logic
R/helper_pbp_add_yardage.R
add_yardage() now ensures/uses cleaned_text and replaces play_text references; expanded and realigned regex branches and fallbacks to handle additional 2025 play-text variants for rush, pass, and receiving yard extraction.
Namespace & Imports
NAMESPACE
Exported clean_play_text and added importFrom(stringr, str_replace).
Tests
tests/testthat/test-cfbd_pbp_data.R, tests/testthat/test-cfbd_play_stats_types.R, tests/testthat/test-cfbd_play_types.R
Added parameterized 2025 yardage tests (uses patrick); adjusted expectations for cfbd_play_stats_types() (25→26 rows) and cfbd_play_types() (48→49 rows).
Project File
cfbfastR.Rproj
Formatting/spacing adjustments only (no functional change).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • saiemgilani

Poem

🐰 In tangled text I hopped and peeped,
I nudged out time and places deep.
With whiskers twitched and regex bright,
The yards now leap into the light.
Hooray—cleaned strings and tidy play!

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title 'More receiving yards parsing cases' directly reflects the main change: expanded regex/extraction logic in the yardage parsing to handle additional textual patterns for receiving yards and broader play text scenarios.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
R/helper_pbp_add_yardage.R (1)

44-56: Document the dependency on cleaned_text column.

The function now relies on a cleaned_text column (used starting at line 58) but doesn't validate its presence or document this requirement. Consider either:

  1. Adding parameter documentation that play_df must contain a cleaned_text column
  2. Adding a validation check at the start of the function
  3. Creating the column if it doesn't exist by calling clean_play_text()
🔎 Example validation check
 add_yardage <- function(play_df) {
+  if (!"cleaned_text" %in% names(play_df)) {
+    stop("play_df must contain a 'cleaned_text' column. Call clean_play_text() first.")
+  }
+  
   play_df$yds_rushed <- NA_real_
   play_df$yds_receiving <- NA_real_
🧹 Nitpick comments (1)
R/cfbd_pbp_data.R (1)

2168-2179: Consider refactoring for efficiency and completeness.

The function could be improved in several ways:

  1. Efficiency: Multiple str_replace calls that reassign to the same column are less efficient than chaining operations or using str_replace_all with multiple patterns.
  2. Whitespace handling: Consider adding str_trim() at the end to remove leading/trailing whitespace that may result from the replacements.
  3. Pattern consolidation: Some patterns could potentially be combined (e.g., the two "No Huddle" patterns on lines 2174-2175).
💡 Example refactored implementation
clean_play_text <- function(play_df) {
  play_df <- play_df %>%
    dplyr::mutate(
      cleaned_text = .data$play_text %>%
        stringr::str_replace("^\\(\\d{1,2}:\\d{2}\\)\\s+", "") %>%
        stringr::str_replace("\\s\\b(short|deep)\\b\\s", " ") %>%
        stringr::str_replace("\\s\\b(left|middle|right)\\b\\s", " ") %>%
        stringr::str_replace("\\s*No Huddle-Shotgun\\s+", "") %>%
        stringr::str_replace("No Huddle-?", "") %>%
        stringr::str_replace("\\s*Shotgun\\s+", "") %>%
        stringr::str_replace("\\s+", " ") %>%
        stringr::str_trim()
    )
}
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b80c73a and a4ff4e1.

📒 Files selected for processing (7)
  • DESCRIPTION
  • NAMESPACE
  • R/cfbd_pbp_data.R
  • R/helper_pbp_add_yardage.R
  • cfbfastR.Rproj
  • man/helpers_pbp.Rd
  • tests/testthat/test-cfbd_pbp_data.R
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-08-25T01:24:22.513Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_stats.R:381-381
Timestamp: 2025-08-25T01:24:22.513Z
Learning: In cfbfastR's cfbd_stats_season_advanced() function, the gsub("Opportunies", "_opportunities", colnames(df)) is intentional and correct for handling the actual API response column naming from CollegeFootballData.com, not a typo as it might appear.

Applied to files:

  • R/helper_pbp_add_yardage.R
  • R/cfbd_pbp_data.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: ubuntu-latest (release)
  • GitHub Check: windows-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
🔇 Additional comments (6)
DESCRIPTION (1)

61-61: LGTM! Appropriate dependency addition.

Adding patrick for parameterized testing aligns with the test updates mentioned in the PR summary.

NAMESPACE (1)

70-70: LGTM! Appropriate NAMESPACE updates.

The new export and import declarations correctly support the clean_play_text() function.

Also applies to: 152-152

man/helpers_pbp.Rd (1)

11-11: LGTM! Complete documentation for the new function.

The documentation properly describes clean_play_text() including its purpose, usage, and return values. The note about ESPN PBP changes in 2025 provides helpful context.

Also applies to: 26-26, 181-184, 342-345

R/cfbd_pbp_data.R (1)

598-598: LGTM! Appropriate pipeline placement.

Calling clean_play_text() before penalty_detection() ensures the text is cleaned early in the processing pipeline.

R/helper_pbp_add_yardage.R (1)

157-238: Inconsistent use of cleaned_text vs play_text across yardage extractions should be clarified.

The yds_rushed and yds_receiving calculations use cleaned_text (lines 58–150), while yds_int_return, yds_kickoff, yds_punted, yds_fumble_return, and yds_sacked still use play_text (lines 157–238).

The cleaned_text function removes pass depth/direction qualifiers (short/deep/left/middle/right), clock timestamps, and huddle/shotgun markers. These removals likely benefit rush and receiving patterns but may not affect special teams or interception patterns, which use different regex anchors. However, this design decision lacks documentation. Either:

  • Add a comment explaining why only rush/receiving need cleaned_text and others don't
  • Apply cleaned_text consistently across all yardage extractions for uniformity
  • Document in the function's roxygen comments which yardage types require text cleaning
tests/testthat/test-cfbd_pbp_data.R (1)

57-59: No action needed. The test filtering mechanism will work correctly.

The cfbd_pbp_data() function preserves the original play_text column. The clean_play_text() function creates a separate cleaned_text column via dplyr::mutate() without modifying play_text. The exact string matching on line 57 using the full timestamp strings like "(14:46) Shotgun #10 H.King..." will find the intended plays as expected.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4ff4e1 and f9704f9.

📒 Files selected for processing (4)
  • R/helper_pbp_add_yardage.R
  • tests/testthat/test-cfbd_pbp_data.R
  • tests/testthat/test-cfbd_play_stats_types.R
  • tests/testthat/test-cfbd_play_types.R
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/testthat/test-cfbd_pbp_data.R
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-25T01:24:22.513Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_stats.R:381-381
Timestamp: 2025-08-25T01:24:22.513Z
Learning: In cfbfastR's cfbd_stats_season_advanced() function, the gsub("Opportunies", "_opportunities", colnames(df)) is intentional and correct for handling the actual API response column naming from CollegeFootballData.com, not a typo as it might appear.

Applied to files:

  • R/helper_pbp_add_yardage.R
📚 Learning: 2025-08-25T01:47:09.915Z
Learnt from: saiemgilani
Repo: sportsdataverse/cfbfastR PR: 113
File: R/cfbd_play.R:731-733
Timestamp: 2025-08-25T01:47:09.915Z
Learning: In the cfbfastR codebase, the pattern `tibble::tibble(data = .data$.)` in R/cfbd_play.R is valid and should not be flagged as incorrect usage of the .data pronoun.

Applied to files:

  • tests/testthat/test-cfbd_play_stats_types.R
  • tests/testthat/test-cfbd_play_types.R
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: windows-latest (release)
  • GitHub Check: ubuntu-latest (oldrel-1)
  • GitHub Check: ubuntu-latest (release)
🔇 Additional comments (5)
tests/testthat/test-cfbd_play_stats_types.R (1)

5-12: LGTM - Test expectation updated for new play stat type.

The row count expectation reflects an additional play stat type from the API. Note that this test is coupled to external data and may require future updates if the API schema changes.

R/helper_pbp_add_yardage.R (3)

55-58: Good defensive initialization of cleaned_text.

This fallback ensures the function works correctly whether or not the caller has pre-processed the data with clean_play_text().


62-113: Previous regex alternation issue resolved by restructuring.

The lookbehind patterns have been correctly split into separate branches for "run" and "rush" variants, eliminating the parsing ambiguity flagged in the previous review. Each pattern is now unambiguous (e.g., "(?<= run for a loss of)" vs "(?<= rush for a loss of)").


158-167: Verify regex pattern combination for PASSER pass edge case.

Lines 163-167 check for both "pass$" (end of string) AND "^to " (start of string) in the same cleaned_text. This seems contradictory unless clean_play_text() transforms the text in a way that makes both patterns matchable.

Could you verify this logic works as intended with sample 2025 play data? If cleaned_text cannot simultaneously end with "pass" and start with "to ", this branch will never match.

tests/testthat/test-cfbd_play_types.R (1)

5-12: LGTM - Test expectation updated for new play type.

The row count expectation reflects an additional play type from the API. Same consideration as the play stats types test regarding external data coupling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants