Skip to content

ICU-22789 Add Segmenter API to conveniently wrap BreakIterator in ICU4J #3237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 73 commits into
base: main
Choose a base branch
from

Conversation

echeran
Copy link
Contributor

@echeran echeran commented Oct 8, 2024

In order to "modernize" the BreakIterator API, this PR introduces a new wrapper using a more convenient, modern API design around a Segmenter interface.

A few of the goals that motivate the new Segmenter API:

  • Use newer Java features from Java 8 that support the Stream API which underlies a functional programming style
  • Create instances that are immutable (reduces complexity borne of statefulness; allows user code to be more referentially transparent)
  • Create a wrapper class around the iteration. This allows the decoupling of the iteration of a source string from the construction of the BreakIterator such that we can perform iteration over one string in isolation from other strings
  • Use interfaces to properly decouple and abstract. APIs built on top of interfaces can allow user-created implementations to participate in such higher level APIs.

More details in the design doc.

This PR will focus on the ICU4J side of the work.

Checklist

  • Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22789
  • Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • Issue accepted (done by Technical Committee after discussion)
  • Tests included, if applicable
  • API docs and/or User Guide docs changed or added, if applicable

echeran added 30 commits October 8, 2024 16:26
@echeran echeran changed the title ICU-22789 Add Segmenter API to conveniently wrap BreakIterator ICU-22789 Add Segmenter API to conveniently wrap BreakIterator in ICU4J Apr 11, 2025
@echeran echeran marked this pull request as ready for review April 11, 2025 22:55
richgillam
richgillam previously approved these changes Apr 17, 2025
Copy link
Contributor

@richgillam richgillam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! One quibble.

} else {
assert direction == IterationDirection.BACKWARDS;
this.currIdx = breakIter.preceding(startIdx);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If startIdx is a boundary, there's no way to tell from this API. Is that intentional? Is that handled somewhere else? Do we always assume startIdx is a boundary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation. Yes, there is a difference between the semantics of the Segments.backFrom(...) public API that includes a boundary and the behavior of BreakIterator.preceding() that always moved backwards at least 1.

I had the logic for adjusting startIdx in the backwards direction condition to account for that difference is in the helper utility fn SegmentsImplUtils.boundariesBackFrom().

This comment reminds me that it would make sense to have this logic closer to where it is actually acted upon, which is here in BoundaryIteratorOfInts. So I just now went ahead and refactored that logic out from the helper util class and into the constructor of BoundaryIteratorOfInts. Even though BoundaryIteratorOfInts is an internal helper class, the refactoring makes it behave more like the public APIs we're supporting without exposing BreakIterator impl details to private internal callers, which is cleaner, even if transparent to users.

I didn't do so already because I would need to additionally pass a reference to the input CharSequence to the constructor of BoundaryIteratorOfInts, which is technically redundant because it already needs a BreakIterator as an argument that theoretically contains that CharSequence. However, I can't get a CharSequence back from a BreakIterator; I only get back a CharacterIterator. Even though it's redundant, I figure the cost of adding a reference to CharSequence is minimal and acceptable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the explanation. That all makes sense and sounds good.

*/
@Override
@Deprecated
public BreakIterator getNewBreakIterator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this function need to be declared public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question is a question of why the interface Segmenter needs to include a method getNewBreakIterator(), which then implies that the method is public.

My reason for including this is based on the need for isolation (separation of state) across all concrete instances of the Segments interface that are spawned each time that a concrete Segmenter object is fed a new input CharSequence. In order to ensure isolation in the Segments implementations, they need to create a new BreakIterator instance.

Given the way things are implemented, you could argue that the interface method getNewBreakIterator() could be removed without much hassle. I wanted to include it so that the interface communicates the common denominator requirements to fulfill the contract.

I wished that there could be a way to say "make this private (non-user facing) but still make it a part of the contract that has to be implemented by concrete implementing classes". The Java 9 update to interfaces to allow private methods is not fully what I want -- it only allows those private methods to be used as helpers when implementing default implementation methods, but it does not get included in the contract for implementing classes.

One the one hand, I like the principle of the current design in defining a contract for code that implements Segmenter & Segments, but on the other hand, I don't like that we have to mention BreakIterator again, even if it's in a deprecate public method.

I could be convinced to remove the (public) getBreakIterator interface method. One possible argument is that Markus mentioned the possibility of changing the internals of the concrete classes to be more efficient by rewriting them to not depend on & wrap BreakIterator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a fairly long time since I last through about this stuff, so I'm not sure I'm completely up on what we were talking about here, and I apologize for that. I think I had two thoughts when I wrote that comment (although who knows now?): One was that this is an implementation detail and it's unfortunate to have it in the public API, even declared @Deprecated. But, of course, you can't always avoid that in Java, and I think you've made a good case for it. The other was that it kind of assumes that all concrete subclasses of LocalizedSegmenter will use a BreakIterator for their implementation. If we want to move away from having BreakIterator inside these things at some point, or if we want clients to be able to create subclasses that do their own thing, this reference stands in the way. I'm guessing in cases like that, you could just have it return null, and that'd be fine, but it'd still be kind of a wart. On the other hand, if we don't think there's anything to be gained from moving away from BreakIterator, then it's fine the way it is and you can ignore me. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rereading my comments below, it looks like I was suggesting that you move the getNewBreakIterator() method into the classes where it's getting used and where it could be made private so that you don't have to expose it to the public API. And I think I'm reading your response to say that you can't do that-- the information necessary to create a new break iterator lives in Segmenter and there isn't a good way to pass it all along. I'll buy that-- this might just be a case where we have to live with the deprecated public method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the question you raised about the possibility of future work that would reimplement the internals to bypass BreakIterator is a good question. It makes me think that your original point of excluding the public deprecated (internal) method from the interface is a good one if there is even just a slight chance of such future work. So I'm leaning in favor of removing it, and saying that doing so is the proper decision.

Perhaps the counterargument is, you could say, that it's easier to have a deprecated interface method that you can ignore in the future (and have future implementations just implement it by returning null, like you said), rather than adding a method to an interface later on. You can't really add a method to an interface or abstract class without breaking the contract for existing implementations, as this comment in BreakIterator.preceding() points out, ironically enough.

So I think the question is back to what you said, is can Segmenter stand on its own independently, or is it defined intrinsically as a wrapper of BreakIterator? I think the answer is the former, which is why I'm leaning in favor of removing it, like you said.

Maybe @markusicu could weigh in?

private LocalizedSegments(CharSequence source, LocalizedSegmenter segmenter) {
this.source = source;
this.segmenter = segmenter;
this.breakIter = this.segmenter.getNewBreakIterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing LocalizedSegmented.getNewBreakIterator() is public because you're calling it here. Would it make more sense to either have it be a method on LocalizedSegments, or for the segmenter to call that (now-private) function and pass the result to this constructor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See conversation in thread above.

*/
@Override
@Deprecated
public BreakIterator getNewBreakIterator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thoughts as the version of this method on LocalizedSegmenter. I think you could avoid having this be public by either having the segmenter pass the break iterator to the Segments object on construction or by having this actually be a method on the Segments object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See conversation in thread above.

@richgillam
Copy link
Contributor

Looks great! One quibble.

Actually, one other observation. As things stand, Segmenter and its subclasses don't do much-- they just create Segments objects, which wrap BreakIterator objects that do all the work. The API is a lot cleaner and clearer, but the implementation isn't. I assume the plan in the future at some point is to move to an implementation where the Segmenter actually owns the state and category tables and the Segments object just handles iteration over a particular string? (I'm not saying you need to do this now; just clarifying that that's in the plan.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants