Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propertiness #1064

Merged
merged 16 commits into from
Apr 9, 2025
Merged

Propertiness #1064

merged 16 commits into from
Apr 9, 2025

Conversation

eggrobin
Copy link
Member

@eggrobin eggrobin commented Mar 18, 2025

A classification of properties derived from presence in PropertyAliases, or derived from a field that we are forced to fill in in ExtraPopertyAliases (contrast PropertyStatus.java which is out of date).

In character.jsp, split the information into (UCD properties, non-UCD properties, UCD non-properties, non-UCD non-properties), with a further split for Unihan (out of UCD properties and after UCD non-properties). See it in staging:

@macchiati
Copy link
Member

I'd suggest that the top be the properties on https://www.unicode.org/reports/tr18/#RL2.7, perhaps with those groupings.

Put all Contributory and Provisional into a separate bucket.

Not sure what the parens are for, as in (kEH_Core)

Some values don't have links, eg "Obsolete"

Identifier_Status Restricted
Identifier_Type Obsolete

The
If you are going to have a bucket Non-UCD properties for U+A7FE, then add confusable, emoji, ...

Will look it over more tomorrow.

@eggrobin
Copy link
Member Author

eggrobin commented Mar 19, 2025

Not sure what the parens are for

Provisional, see the heading Normative, Informative, Contributory, and (Provisional) UCD properties.

I'd suggest that the top be the properties on https://www.unicode.org/reports/tr18/#RL2.7, perhaps with those groupings.

Finer property status (splitting out Contributory etc.) and groupings would be nice, but we do not have a maintainable way of keeping track of it so far (there was an attempt with PropertyStatus.java, but as noted in the PR description, that did not work). Here I am instead doing what I can based on what we are forced to maintain, namely *PropertyAliases.txt.

Some values don't have links, eg "Obsolete"

Yes, that is because it is multivalued, see #1018 item 2.

If you are going to have a bucket Non-UCD properties for U+A7FE, then add confusable, emoji, ...

Confusable is there, it goes into Non-UCD non-properties (Other information). The Identifier_* stuff is what UTS39 actually describes as a property.

RGI_Emoji (but not RGI_Emoji_*_Sequence) should be there because it is described as a property in UTS51, but isn’t because it is hacked directly into the JSPs instead of being in IndexUnicodeProperties; I will add it later, see the TODOs in ExtraPropertyAliases.

@eggrobin
Copy link
Member Author

Here I am instead doing what I can based on what we are forced to maintain, namely *PropertyAliases.txt.

Note that beyond the cosmetics of grouping character.jsp, we actually want to keep track of the « is this a UCD property » information, see #1049.

@eggrobin
Copy link
Member Author

Note: I tried splitting out Provisional from Normative+Informative, and that seemed counterproductive for Unihan and Unikemet (which are the only places where we have Informative properties) to have them in two blocks; hence the parentheses approach.

@eggrobin
Copy link
Member Author

RGI_Emoji (but not RGI_Emoji_*_Sequence) should be there because it is described as a property in UTS51

Ah nevermind, I see UTS51 also describes the RGI_Emoji_*_Sequence zoo as properties. I’ll fix that.

@eggrobin
Copy link
Member Author

As noted in the TODOs, I’d like to move RGI_Emoji and IDNA2008_Category into IndexUnicodeProperties (rather than being patched into the JSPs), and to add RGI_Emoji_Qualification, all of these being NonUcdProperty.

But I will do that in a subsequent PR.

@eggrobin
Copy link
Member Author

eggrobin commented Apr 8, 2025

@markusicu Friendly ping, since I think some of @jowilco’s work is blocked on this.

markusicu
markusicu previously approved these changes Apr 8, 2025
@@ -1440,6 +1442,42 @@ public static void showProperties(

String kRSUnicode = getFactory().getProperty("kRSUnicode").getValue(cp);
boolean isUnihan = kRSUnicode != null;
List<UcdProperty> indexedProperties =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional -- might simplify something:

How about, rather than just building separate lists of properties, you add an enum PropCategory { UCD, NON_UCD, ... CJK, ...}, and create a Map<PropCategory, List<UcdProperty>>?

You could then also use maps from PropCategory to table headings and such.


Does it matter if these lists are List's? Or do you just need Collection's?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if these lists are List's?

Not really, I convert them to lists of String below anyway (because one of them is not a list of UcdProperty, namely the list of stuff that gets added in the tools).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will merge this now to unblock John and see if I can come up with something cleaner in a subsequent PR.

Co-authored-by: Markus Scherer <[email protected]>
@eggrobin eggrobin merged commit 86c22fd into unicode-org:main Apr 9, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants