-
-
Notifications
You must be signed in to change notification settings - Fork 801
ICU-22789 Add Segmenter API to conveniently wrap BreakIterator in ICU4J #3237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…e static util fns for concrete classes
…ent field of source CS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! One quibble.
} else { | ||
assert direction == IterationDirection.BACKWARDS; | ||
this.currIdx = breakIter.preceding(startIdx); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If startIdx
is a boundary, there's no way to tell from this API. Is that intentional? Is that handled somewhere else? Do we always assume startIdx
is a boundary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good observation. Yes, there is a difference between the semantics of the Segments.backFrom(...)
public API that includes a boundary and the behavior of BreakIterator.preceding()
that always moved backwards at least 1.
I had the logic for adjusting startIdx
in the backwards direction condition to account for that difference is in the helper utility fn SegmentsImplUtils.boundariesBackFrom()
.
This comment reminds me that it would make sense to have this logic closer to where it is actually acted upon, which is here in BoundaryIteratorOfInts
. So I just now went ahead and refactored that logic out from the helper util class and into the constructor of BoundaryIteratorOfInts
. Even though BoundaryIteratorOfInts
is an internal helper class, the refactoring makes it behave more like the public APIs we're supporting without exposing BreakIterator impl details to private internal callers, which is cleaner, even if transparent to users.
I didn't do so already because I would need to additionally pass a reference to the input CharSequence
to the constructor of BoundaryIteratorOfInts
, which is technically redundant because it already needs a BreakIterator
as an argument that theoretically contains that CharSequence
. However, I can't get a CharSequence
back from a BreakIterator
; I only get back a CharacterIterator
. Even though it's redundant, I figure the cost of adding a reference to CharSequence
is minimal and acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the explanation. That all makes sense and sounds good.
*/ | ||
@Override | ||
@Deprecated | ||
public BreakIterator getNewBreakIterator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this function need to be declared public
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This question is a question of why the interface Segmenter
needs to include a method getNewBreakIterator()
, which then implies that the method is public.
My reason for including this is based on the need for isolation (separation of state) across all concrete instances of the Segments
interface that are spawned each time that a concrete Segmenter
object is fed a new input CharSequence
. In order to ensure isolation in the Segments
implementations, they need to create a new BreakIterator
instance.
Given the way things are implemented, you could argue that the interface method getNewBreakIterator()
could be removed without much hassle. I wanted to include it so that the interface communicates the common denominator requirements to fulfill the contract.
I wished that there could be a way to say "make this private (non-user facing) but still make it a part of the contract that has to be implemented by concrete implementing classes". The Java 9 update to interfaces to allow private methods is not fully what I want -- it only allows those private methods to be used as helpers when implementing default implementation methods, but it does not get included in the contract for implementing classes.
One the one hand, I like the principle of the current design in defining a contract for code that implements Segmenter
& Segments
, but on the other hand, I don't like that we have to mention BreakIterator
again, even if it's in a deprecate public method.
I could be convinced to remove the (public) getBreakIterator
interface method. One possible argument is that Markus mentioned the possibility of changing the internals of the concrete classes to be more efficient by rewriting them to not depend on & wrap BreakIterator
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a fairly long time since I last through about this stuff, so I'm not sure I'm completely up on what we were talking about here, and I apologize for that. I think I had two thoughts when I wrote that comment (although who knows now?): One was that this is an implementation detail and it's unfortunate to have it in the public API, even declared @Deprecated
. But, of course, you can't always avoid that in Java, and I think you've made a good case for it. The other was that it kind of assumes that all concrete subclasses of LocalizedSegmenter
will use a BreakIterator
for their implementation. If we want to move away from having BreakIterator
inside these things at some point, or if we want clients to be able to create subclasses that do their own thing, this reference stands in the way. I'm guessing in cases like that, you could just have it return null, and that'd be fine, but it'd still be kind of a wart. On the other hand, if we don't think there's anything to be gained from moving away from BreakIterator
, then it's fine the way it is and you can ignore me. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rereading my comments below, it looks like I was suggesting that you move the getNewBreakIterator()
method into the classes where it's getting used and where it could be made private so that you don't have to expose it to the public API. And I think I'm reading your response to say that you can't do that-- the information necessary to create a new break iterator lives in Segmenter
and there isn't a good way to pass it all along. I'll buy that-- this might just be a case where we have to live with the deprecated public method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think the question you raised about the possibility of future work that would reimplement the internals to bypass BreakIterator is a good question. It makes me think that your original point of excluding the public deprecated (internal) method from the interface is a good one if there is even just a slight chance of such future work. So I'm leaning in favor of removing it, and saying that doing so is the proper decision.
Perhaps the counterargument is, you could say, that it's easier to have a deprecated interface method that you can ignore in the future (and have future implementations just implement it by returning null, like you said), rather than adding a method to an interface later on. You can't really add a method to an interface or abstract class without breaking the contract for existing implementations, as this comment in BreakIterator.preceding()
points out, ironically enough.
So I think the question is back to what you said, is can Segmenter stand on its own independently, or is it defined intrinsically as a wrapper of BreakIterator? I think the answer is the former, which is why I'm leaning in favor of removing it, like you said.
Maybe @markusicu could weigh in?
private LocalizedSegments(CharSequence source, LocalizedSegmenter segmenter) { | ||
this.source = source; | ||
this.segmenter = segmenter; | ||
this.breakIter = this.segmenter.getNewBreakIterator(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing LocalizedSegmented.getNewBreakIterator()
is public because you're calling it here. Would it make more sense to either have it be a method on LocalizedSegments
, or for the segmenter to call that (now-private) function and pass the result to this constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See conversation in thread above.
*/ | ||
@Override | ||
@Deprecated | ||
public BreakIterator getNewBreakIterator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same thoughts as the version of this method on LocalizedSegmenter
. I think you could avoid having this be public by either having the segmenter pass the break iterator to the Segments
object on construction or by having this actually be a method on the Segments
object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See conversation in thread above.
Actually, one other observation. As things stand, |
In order to "modernize" the
BreakIterator
API, this PR introduces a new wrapper using a more convenient, modern API design around aSegmenter
interface.A few of the goals that motivate the new
Segmenter
API:Stream
API which underlies a functional programming styleMore details in the design doc.
This PR will focus on the ICU4J side of the work.
Checklist