Skip to content

Add data and start_offset methods for StringView #2090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MINGtoMING
Copy link
Contributor

The StringView currently lacks methods to obtain raw data and starting offset, which hinders seamless switching between StringView and String.

Here is an example:

  • Original:
///|
pub type Oid

///|
extern "C" fn oid_from_string_ffi(
  str : String,
  oid_buf : FixedArray[Oid?],
  err_buf : FixedArray[GitErrorMeta?]
) -> Unit = "oid_from_string_ffi"

///|
pub fn Oid::from_string(str : String) -> Oid!GitError {
  let oid_buf : FixedArray[Oid?] = [None]
  let err_buf : FixedArray[GitErrorMeta?] = [None]
  oid_from_string_ffi(str, oid_buf, err_buf)
  match (oid_buf, err_buf) {
    ([Some(oid)], [None]) => oid
    ([None], [Some(err)]) => raise GitError(err)
    _ => abort("unreachable")
  }
}
  • After adding:
///|
pub type Oid

///|
extern "C" fn oid_from_string_ffi(
  str : String,
  start_offset : Int,
  length : Int,
  oid_buf : FixedArray[Oid?],
  err_buf : FixedArray[GitErrorMeta?]
) -> Unit = "oid_from_string_ffi"

///|
pub fn Oid::from_string(str : @string.StringView) -> Oid!GitError {
  let oid_buf : FixedArray[Oid?] = [None]
  let err_buf : FixedArray[GitErrorMeta?] = [None]
  oid_from_string_ffi(
    str.data(),
    str.start_offset(),
    str.length(),
    oid_buf,
    err_buf,
  )
  match (oid_buf, err_buf) {
    ([Some(oid)], [None]) => oid
    ([None], [Some(err)]) => raise GitError(err)
    _ => abort("unreachable")
  }
}

Copy link

Documentation could be more specific about UTF-16 code units

Category
Correctness
Code Snippet
/// Returns the starting offset (in UTF-16 code units) of this view into its
/// underlying string.
pub fn start_offset(self : View) -> Int
Recommendation
Add a warning note about potential Unicode handling: '/// Note: The offset is in UTF-16 code units, not Unicode characters. Care should be taken when working with non-BMP Unicode characters.'
Reasoning
When working with string offsets in Unicode, it's crucial to be explicit about the encoding units. This helps prevent bugs when dealing with emoji or other characters outside the Basic Multilingual Plane.

Missing documentation for data() method's ownership semantics

Category
Maintainability
Code Snippet
///|
/// Returns the original string that is being viewed.
pub fn data(self : View) -> String
Recommendation
Add clarification about ownership: '/// Returns a reference to the original string that is being viewed. The returned String maintains the same lifetime as the StringView.'
Reasoning
Clear documentation about ownership and lifetime relationships is crucial for preventing memory-related bugs and helping users understand the API's behavior.

Test cases could be more comprehensive

Category
Maintainability
Code Snippet
test "stringview data" {
let str = "Hello🤣🤣🤣"
let view = str.view(start_offset=1, end_offset=7)
inspect!(view.data(), content="Hello🤣🤣🤣")
}
Recommendation
Add more test cases covering edge cases: empty strings, strings with only emojis, and views that span exactly one emoji character
Reasoning
The current tests only cover basic cases. Adding more edge cases would help ensure the implementation handles all scenarios correctly, especially with Unicode boundaries.

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 6680

Details

  • 2 of 2 (100.0%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.002%) to 92.455%

Totals Coverage Status
Change from base Build 6679: 0.002%
Covered Lines: 6139
Relevant Lines: 6640

💛 - Coveralls

@Yu-zh
Copy link
Collaborator

Yu-zh commented May 14, 2025

Is there a reason that you are using String and @string.View for C FFI instead of Bytes and @bytes.View? We have provided APIs for accessing the data and start_offset in @bytes.View. See #1822 for more background information.

@MINGtoMING
Copy link
Contributor Author

MINGtoMING commented May 14, 2025

The reason for using @string.View instead of @bytes.View here is as follows:

  1. Semantic consistency: The purpose of Oid::from_string is to create an object id from a hexadecimal git id string (or prefix slice). Using @bytes.View would essentially treat the parameter as a CString, which would require additional clarification to avoid ambiguity.
  2. Ergonomics: The caller can use it directly without worrying about whether the encoding of the passed @bytes.View is correct. For example: let oid = Oid::from_string!("aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
  3. Interface conflict: There is another function, Oid::from_raw, which directly copies and creates a git oid from the passed Bytes.

@peter-jerry-ye
Copy link
Collaborator

Although I don't think it's a bad idea to add these functions, I don't think your reasons hold.

You can pass the Bytes at the FFI boundary, i.e. oid_from_string_ffi, and perform the encoding / decoding on the MoonBit side. And as far as I'm aware of, the git_oid_from_string does not accept UTF16 encoded strings.

@MINGtoMING
Copy link
Contributor Author

Yes, the C prototype of oid_from_string_ffi accepts a hex string of length 40. Perhaps directly accepting a Bytes with \x00 at the end would be more convenient, but on the Moonbit interface side, it feels somewhat inelegant and ambiguous. My previous solution was to place the encoding conversion in the C code to utilize the system-provided encoding conversion functions(iconv / win).

@peter-jerry-ye
Copy link
Collaborator

Solely for your case, I think you don't even need the encoder or iconv / win since you should be accepting only asciis. A simple for loop should be enough for you.

@MINGtoMING
Copy link
Contributor Author

Right, in this func, it can be handled better as you suggested.

@MINGtoMING
Copy link
Contributor Author

However, in other cases, when the amount of data contained in StringView is large, converting it to Bytes(or other) before interaction may incur significant overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants