Skip to content

feat(string): add find_by_charcode #2084

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

Yu-zh
Copy link
Collaborator

@Yu-zh Yu-zh commented May 12, 2025

Motivation: find_by_charcode will be more efficient than find_by because it skips the surrogate pair checks. This API is used for performance-critical applications.

Additionally, I'm not sure whether we want to expose an API that returns Iter[Int] or similar so that users care about performance can implement customized high performance string operations.

Copy link

peter-jerry-ye-code-review bot commented May 12, 2025

Direct duplication of find_by_charcode implementation between String and View

Category
Performance
Code Snippet
pub fn String::find_by_charcode and pub fn View::find_by_charcode have identical implementations
Recommendation
Consider implementing String::find_by_charcode in terms of View::find_by_charcode like:

pub fn String::find_by_charcode(self : String, pred : (Int) -> Bool) -> Int? {
  self[:].find_by_charcode(pred)
}

Reasoning
Duplicated code increases maintenance burden. Since String can be viewed as a View, we can reuse the View implementation to maintain consistency and reduce code duplication, similar to how the original find_by was implemented

Missing documentation about performance implications and usage guidance

Category
Maintainability
Code Snippet
/// Find the index of the first charcode that satisfies the given predicate.
Recommendation
Expand documentation to explain performance benefits and limitations:

/// Find the index of the first charcode that satisfies the given predicate.
/// This is a low-level API that operates on raw character codes for performance.
/// Note: Use with caution as it bypasses Unicode character boundaries.
/// For most cases, prefer find_by unless performance is critical.

Reasoning
The current documentation doesn't explain why this API exists alongside find_by or when to use one over the other. Adding clear guidance helps developers make informed decisions

Test cases could be more comprehensive for edge cases

Category
Correctness
Code Snippet
test "find_by_charcode" {
Recommendation
Add test cases for:

  1. Invalid UTF-8 sequences
  2. Strings with mixed surrogate pairs
  3. Performance comparison with find_by
  4. Boundary conditions with very large strings
    Reasoning
    While the current test suite is good, adding these cases would help verify the correctness and performance benefits in various scenarios that might occur in production use

@coveralls
Copy link
Collaborator

coveralls commented May 12, 2025

Pull Request Test Coverage Report for Build 6727

Details

  • 6 of 6 (100.0%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.02%) to 92.463%

Files with Coverage Reduction New Missed Lines %
string/view.mbt 1 84.51%
Totals Coverage Status
Change from base Build 6725: 0.02%
Covered Lines: 6146
Relevant Lines: 6647

💛 - Coveralls

@Yu-zh Yu-zh requested a review from bobzhang May 16, 2025 06:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants