Skip to content

Disable trim_text in Deserializer from_reader #285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
woodworker opened this issue Apr 17, 2021 · 24 comments · May be fixed by #561
Open

Disable trim_text in Deserializer from_reader #285

woodworker opened this issue Apr 17, 2021 · 24 comments · May be fixed by #561
Labels
enhancement help wanted serde Issues related to mapping from Rust types to XML

Comments

@woodworker
Copy link

Is there a easy way to set trim_text to false in the Deserializer::from_str when i use quick_xml::de::from_str?

quick-xml/src/de/mod.rs

Lines 160 to 167 in a4be484

pub fn from_reader(reader: R) -> Self {
let mut reader = Reader::from_reader(reader);
reader
.expand_empty_elements(true)
.check_end_names(true)
.trim_text(true);
Self::new(reader)
}

@ImJeremyHe
Copy link

Is there a way to determine whether set trim_text by looking if there is xml:space = "preserve"?
For example:

<t xml:space="preserve">Text </t>

The trailing space here should not be trimmed.

@tafia
Copy link
Owner

tafia commented May 12, 2021

There are very little customization on the serde deserializer so far. I don't think there is any major blocking point, someone just needs to write it.

@Mingun Mingun added enhancement help wanted serde Issues related to mapping from Rust types to XML labels May 21, 2022
@Mingun
Copy link
Collaborator

Mingun commented May 21, 2022

In the coming release Deserializer::new would be public and you could create a deserializer from a Reader (but do not turn off expand_empty_elements! For now Deserializer is not prepared for that).

Processing of xml:space still waits its own PR

@naumazeredo
Copy link

In the coming release Deserializer::new would be public

Was this implemented already? Deserializer::new is public in the latest version, but it seems useless since XmlRead can't be implemented outside of quick_xml and SliceReader and IoReader can't be instantiated also. There's no way to use Deserializer::new, unless I'm missing something here.

@Mingun
Copy link
Collaborator

Mingun commented Jul 28, 2022

Yes, this is oversight. So, currently this is still not possible, even in master. Need to think about better API. I also would to provide an API to create a deserializer for a part of XML, so you can mix usual Reader usage with the Deserializer usage, for example, to support streaming deserialization.

@Mingun
Copy link
Collaborator

Mingun commented Jul 28, 2022

According to the original use case -- I do not think that simply disabling trim_text would be usable -- it seems that you'll just break deserialization of pretty-printed XMLs at all with that setting

@naumazeredo
Copy link

Yeah, that's exactly what happened. I didn't get why that option, even though internal, exists.
I've sadly moved back to serde_xml_rs since they give an option to not trim and I'm not willing to spend more time debugging xml deserialization right now. I'll be trying quick_xml in the future in case it gets more versatility

@dralley
Copy link
Collaborator

dralley commented Jul 29, 2022

The trimming of spaces within elements probably ought to be separated from the trimming of spaces between elements. It should be possible (and probably the default) to ignore the latter without affecting the text contents of elements themselves.

Having an option for trimming spaces around text contents is nice of course, but not at all necessary (the user could easily do this themselves) and as this issue points out it is more difficult to do "correctly" than originally envisioned. Maybe we should eliminate this feature and just keep the "ignore spaces between XML elements" functionality?

@Mingun
Copy link
Collaborator

Mingun commented Jul 29, 2022

The trimming of spaces within elements probably ought to be separated from the trimming of spaces between elements.

Yes, I think, we should move in that direction. A couple of thoughts:

  • need to take into account a Deserializing tag with attribute values into Map #383 problem. I think, it can be solved by introducing a method to read all content as a string regardless of the XML markup inside:
    impl Reader {
      /// For XML
      ///
      /// <outer>  <inner/>   </outer>
      ///
      /// - can be called after BytesStart("outer")
      /// - returns "  <inner/>   "
      /// - consumes BytesEnd("outer")
      fn read_as_text(&mut self, end: QName) -> Result<Cow<str>> { ... }
    }
    // or maybe better (except for a long name :( )
    impl BytesStart {
      fn read_to_end_as_text(&self, reader: &mut Reader) -> Result<Cow<str>> { ... }
    }
  • we need a lookahead to decide if whitespace is significant or not (==determine the shape of the next tag -- opening/closing/self-closing/comment/PI/CData?) -- Is there any way to read an event and not consume it? #414 is related. Should we, for example, allow XML comments in whitespace-significant parts (<outer> <!----> </outer>)? If yes, that would require a potentially infinity lookahead

@Pastequee
Copy link

Hello, little update here, since Deserializer::new still is not public we can't have a from_str or any other variation where it does not trim the content. So I can work on a quick PR for that. But you already discussed about that, so have you a preferred solution ? I've thought about juste adding a from_reader implementation that takes a Reader (not something that implements IoReader) so that you can change the reader's attributes without having access to the Deserializer. This function will just force the attribute expand_empty_elements since you said it is required for the moment.

@Mingun
Copy link
Collaborator

Mingun commented Dec 29, 2022

Just disabling trim_text will not work correctly for pretty-printed XMLs, therefore I doubt that so limited implementation would be useful in mass. Actually, the trim_text* options should not exist at that level of parsing -- it is just wrong place to do trim. I'm working on proper trim implementation in #520 and I plan to implement it in 0.28 which is probably would be released 2-3 months later. After that probably this problem will gone (but maybe not).

Well, I think, that we could add a Deserializer::trim_text(trim: bool) method as a temporary solution, with a proper warning that it will not work for XMLs with pretty-printed parts.

@Mingun
Copy link
Collaborator

Mingun commented Mar 4, 2023

I've just created a #572. When it would be merged, we could change the content of an introduced Text type. We should change it definition to:

struct Text<'a> {
    /// Untrimmed text after concatenating content of all
    /// [`Text`] and [`CData`] events
    text: Cow<'a, str>,
    /// A range into `text` which contains data after trimming
    content: Range<usize>,
}

Such a change will open a door to use a per-field control for trimming

@nsunderland1
Copy link

Are you still looking at fixing this? If not, what remains to be done? This is a breaking issue for my team, and we may be interested in contributing in order to help fix it.

@Mingun
Copy link
Collaborator

Mingun commented Nov 8, 2023

I did not put my efforts in this issue since my last comment. Because #572 was merged, we can move forward by the way outlined in that comment. We also can add a way to globally disable trimming, but I think such setting will have a limited usefulness. If you wish feel free to explore those opportunities.

@mmcloughlin
Copy link

Is there any workaround for this problem short of a proper fix (#561 or other)?

My case is mixed content such as <tag>Hello <b>World</b>!</tag>. I am deserializing into a vector of externally tagged enum variants (#541), but I see that trailing whitespace in the Hello is lost.

Thanks for your work on this library! Any help appreciated.

@jamwil
Copy link

jamwil commented May 7, 2025

@mmcloughlin I started on #855 before getting sidetracked. No workaround as far as I can tell, and the fix is taking more hours than I have.

@mmcloughlin
Copy link

I started on #855 before getting sidetracked. No workaround as far as I can tell, and the fix is taking more hours than I have.

Thanks for the update!

I'd be willing to spend some time on a fix, but I'm not sure I understand exactly what the preferred resolution is? Looks like there's some nuances, and a multi-year discussion in this issue and others. Is there a description somewhere of what kind of PR would be accepted?

Meanwhile, the only (horrific) workaround I am considering is a pre-processor that inserts sentinels around certain tags to prevent trimming, which can be removed after deserialization. So for example <tag>Hello <b>World</b>!</tag> becomes <tag>|Hello |<b>|World|</b>|!|</tag> and the delimiters can be stripped in post-processing. Alternatively I could switch to another Rust XML library, but I'm already deep into integrating quick-xml before I discovered this problem.

@mmcloughlin
Copy link

Trying to gather the options under discussion:

  1. Implement whitespace behavior in the standard (Add trimming option for the de::from_* functions #561 (comment)), which says string primitive types should preserve whitespace, while all other primitives have collapse behavior.

    Mechanically, the proposal is to implement this by adding a content range to the Text type (as in Disable trim_text in Deserializer from_reader #285 (comment) and @Pastequee's b4355c6). Then, I think the various deserialize primitive methods would need to be updated to use the trimmed or untrimmed versions?

    Would this be a breaking change to quick-xml? I guess that's allowed by v0.

  2. Make the Deserializer configurable, specifically with a trim_text flag. This is @jamwil's approach in Add de::from_str_with_whitespace and de::from_reader_with_whitespace #855. What was the plan with this flag: would it be turned on/off globally, or would there be a per-field configuration as well? As mentioned by @Mingun, a global flag could be problematic for pretty-printed XML.

  3. Some kind of per-field configuration. This is non-trivial because serde doesn't allow extra attributes (Add trimming option for the de::from_* functions #561 (comment)). Could we introduce another special field name like $raw, $text_untrimmed, ...? I am not yet clear how easy this would be to implement.

  4. Other?

Which of these would be preferred?

@jamwil
Copy link

jamwil commented May 8, 2025

@mmcloughlin Thanks for distilling this. Personally, and despite having initially started on a different route, I like the idea in Option 1 of following the spec for string primitives if we're not to concern ourselves with backward compatibility.

In #855, I started down the path of having a mutable DeConfig struct owned by Deserializer, per this comment from @Mingun , but once I got into it more I felt I was going in circles a bit. I could be misunderstanding things (likely, I am—I'm fairly new to Rust), but there doesn't seem to be an opportunity or need to have a mutable configuration, since deserializing in this way is a declarative, one-shot operation. I also agree that a global flag is problematic.

@mmcloughlin
Copy link

if we're not to concern ourselves with backward compatibility

Feels like a situation where even if semantic versioning allows it, we'd need to be careful about Hyrum's Law.

@jamwil
Copy link

jamwil commented May 9, 2025

Very valid point. I think we do at least have @Mingun's blessing to explore adjusting the default based on this comment in #561 (emphasis my own):

While working on #574 I've found this: https://www.w3.org/TR/xmlschema11-2/#built-in-primitive-datatypes

This means, that Deserializer could itself determine if trimming is needed. Practically speaking, when deserialize with deserialize_borrowed_str / deserialize_str / deserialize_string we shouldn't trim, in all other cases should.

We could also add a flag "trim / not trim" to the Deserializer (which, actually, probably should be stored in ContentDeserializer), but I recommend first check if the behavior specified specified in the XML specification suits you. If yes, then implement it -- we shouldn't trim during reading, but only store a range with boundaries of not-trimmed content; apply trim when deserialize primitives except strings.

I'm no authority, but if the spec states that string primitives should preserve whitespace, and that simplifies our implementation using @Pastequee's work in b4355c6 as a base, then that approach has my vote. If compliance with the w3 spec is a stated or implied goal of quick-xml, then v0.x is our window to bring it into alignment.

@Mingun
Copy link
Collaborator

Mingun commented May 10, 2025

@mmcloughlin, thanks for summary. Yes, we definitely will implement option 1. @Pastequee catches the idea.

Then, I think the various deserialize primitive methods would need to be updated to use the trimmed or untrimmed versions?

Yes, and it seems that only the deserialize_str, deserialize_borrowed_str and deserialize_string should use untrimmed version, but need to check XML spec carefully. At the same time we could check the presence of the xml:space attribute like we did that for xsi:nil.

I'm not sure about option 2 (global trim_text), it seems will be unnecessary when other options will be implemented.

Option 3 also could be implemented to override default behavior from option 1, but maybe later if it will really be needed. At least this is purely non-breaking change which may be implemented in patch update.

If compliance with the w3 spec is a stated or implied goal of quick-xml, then v0.x is our window to bring it into alignment.

Yes, that is the goal (at least I want to have a mode where we will be fully compliance) and we will break compatibility if that will be needed, because v0.x is designed for that. After that we will release v1.

@jamwil
Copy link

jamwil commented May 10, 2025

Sounds like a plan. @mmcloughlin I'm going to close my wip PR. Feel free to put me to work in whatever way is most helpful.

@mmcloughlin
Copy link

Feel free to put me to work in whatever way is most helpful.

Haha, I'm not sure I can put you to work! Perhaps @Mingun is the one to listen to :)

According to @Mingun's last message, it seems there's consensus on trying Option 1: trimming for primitive types except strings. If you have time to work on that, that would be awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement help wanted serde Issues related to mapping from Rust types to XML
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants