Disable `trim_text` in Deserializer from_reader #285

woodworker · 2021-04-17T20:27:46Z

Is there a easy way to set trim_text to false in the Deserializer::from_str when i use quick_xml::de::from_str?

Lines 160 to 167 in a4be484

    
               pub fn from_reader(reader: R) -> Self { 
        
                   let mut reader = Reader::from_reader(reader); 
        
                   reader 
        
                       .expand_empty_elements(true) 
        
                       .check_end_names(true) 
        
                       .trim_text(true); 
        
                   Self::new(reader) 
        
               }

The text was updated successfully, but these errors were encountered:

ImJeremyHe · 2021-05-06T09:13:19Z

Is there a way to determine whether set trim_text by looking if there is xml:space = "preserve"?
For example:

<t xml:space="preserve">Text </t>

The trailing space here should not be trimmed.

tafia · 2021-05-12T11:28:13Z

There are very little customization on the serde deserializer so far. I don't think there is any major blocking point, someone just needs to write it.

Mingun · 2022-05-21T20:23:47Z

In the coming release Deserializer::new would be public and you could create a deserializer from a Reader (but do not turn off expand_empty_elements! For now Deserializer is not prepared for that).

Processing of xml:space still waits its own PR

naumazeredo · 2022-07-28T05:01:58Z

In the coming release Deserializer::new would be public

Was this implemented already? Deserializer::new is public in the latest version, but it seems useless since XmlRead can't be implemented outside of quick_xml and SliceReader and IoReader can't be instantiated also. There's no way to use Deserializer::new, unless I'm missing something here.

Mingun · 2022-07-28T05:35:54Z

Yes, this is oversight. So, currently this is still not possible, even in master. Need to think about better API. I also would to provide an API to create a deserializer for a part of XML, so you can mix usual Reader usage with the Deserializer usage, for example, to support streaming deserialization.

Mingun · 2022-07-28T05:43:27Z

According to the original use case -- I do not think that simply disabling trim_text would be usable -- it seems that you'll just break deserialization of pretty-printed XMLs at all with that setting

naumazeredo · 2022-07-29T05:43:42Z

Yeah, that's exactly what happened. I didn't get why that option, even though internal, exists.
I've sadly moved back to serde_xml_rs since they give an option to not trim and I'm not willing to spend more time debugging xml deserialization right now. I'll be trying quick_xml in the future in case it gets more versatility

dralley · 2022-07-29T13:37:33Z

The trimming of spaces within elements probably ought to be separated from the trimming of spaces between elements. It should be possible (and probably the default) to ignore the latter without affecting the text contents of elements themselves.

Having an option for trimming spaces around text contents is nice of course, but not at all necessary (the user could easily do this themselves) and as this issue points out it is more difficult to do "correctly" than originally envisioned. Maybe we should eliminate this feature and just keep the "ignore spaces between XML elements" functionality?

Mingun · 2022-07-29T17:01:42Z

The trimming of spaces within elements probably ought to be separated from the trimming of spaces between elements.

Yes, I think, we should move in that direction. A couple of thoughts:

need to take into account a Deserializing tag with attribute values into Map #383 problem. I think, it can be solved by introducing a method to read all content as a string regardless of the XML markup inside:

impl Reader {
  /// For XML
  ///
  /// <outer>  <inner/>   </outer>
  ///
  /// - can be called after BytesStart("outer")
  /// - returns "  <inner/>   "
  /// - consumes BytesEnd("outer")
  fn read_as_text(&mut self, end: QName) -> Result<Cow<str>> { ... }
}
// or maybe better (except for a long name :( )
impl BytesStart {
  fn read_to_end_as_text(&self, reader: &mut Reader) -> Result<Cow<str>> { ... }
}

we need a lookahead to decide if whitespace is significant or not (==determine the shape of the next tag -- opening/closing/self-closing/comment/PI/CData?) -- Is there any way to read an event and not consume it? #414 is related. Should we, for example, allow XML comments in whitespace-significant parts (<outer>  </outer>)? If yes, that would require a potentially infinity lookahead

Pastequee · 2022-12-29T10:14:32Z

Hello, little update here, since Deserializer::new still is not public we can't have a from_str or any other variation where it does not trim the content. So I can work on a quick PR for that. But you already discussed about that, so have you a preferred solution ? I've thought about juste adding a from_reader implementation that takes a Reader (not something that implements IoReader) so that you can change the reader's attributes without having access to the Deserializer. This function will just force the attribute expand_empty_elements since you said it is required for the moment.

Mingun · 2022-12-29T15:35:53Z

Just disabling trim_text will not work correctly for pretty-printed XMLs, therefore I doubt that so limited implementation would be useful in mass. Actually, the trim_text* options should not exist at that level of parsing -- it is just wrong place to do trim. I'm working on proper trim implementation in #520 and I plan to implement it in 0.28 which is probably would be released 2-3 months later. After that probably this problem will gone (but maybe not).

Well, I think, that we could add a Deserializer::trim_text(trim: bool) method as a temporary solution, with a proper warning that it will not work for XMLs with pretty-printed parts.

Mingun · 2023-03-04T18:37:37Z

I've just created a #572. When it would be merged, we could change the content of an introduced Text type. We should change it definition to:

struct Text<'a> {
    /// Untrimmed text after concatenating content of all
    /// [`Text`] and [`CData`] events
    text: Cow<'a, str>,
    /// A range into `text` which contains data after trimming
    content: Range<usize>,
}

Such a change will open a door to use a per-field control for trimming

nsunderland1 · 2023-11-08T01:47:50Z

Are you still looking at fixing this? If not, what remains to be done? This is a breaking issue for my team, and we may be interested in contributing in order to help fix it.

Mingun · 2023-11-08T16:45:58Z

I did not put my efforts in this issue since my last comment. Because #572 was merged, we can move forward by the way outlined in that comment. We also can add a way to globally disable trimming, but I think such setting will have a limited usefulness. If you wish feel free to explore those opportunities.

mmcloughlin · 2025-05-07T04:14:41Z

Is there any workaround for this problem short of a proper fix (#561 or other)?

My case is mixed content such as <tag>Hello World!</tag>. I am deserializing into a vector of externally tagged enum variants (#541), but I see that trailing whitespace in the Hello is lost.

Thanks for your work on this library! Any help appreciated.

jamwil · 2025-05-07T05:53:19Z

@mmcloughlin I started on #855 before getting sidetracked. No workaround as far as I can tell, and the fix is taking more hours than I have.

mmcloughlin · 2025-05-07T18:15:34Z

I started on #855 before getting sidetracked. No workaround as far as I can tell, and the fix is taking more hours than I have.

Thanks for the update!

I'd be willing to spend some time on a fix, but I'm not sure I understand exactly what the preferred resolution is? Looks like there's some nuances, and a multi-year discussion in this issue and others. Is there a description somewhere of what kind of PR would be accepted?

Meanwhile, the only (horrific) workaround I am considering is a pre-processor that inserts sentinels around certain tags to prevent trimming, which can be removed after deserialization. So for example <tag>Hello World!</tag> becomes <tag>|Hello ||World||!|</tag> and the delimiters can be stripped in post-processing. Alternatively I could switch to another Rust XML library, but I'm already deep into integrating quick-xml before I discovered this problem.

mmcloughlin · 2025-05-08T16:10:37Z

Trying to gather the options under discussion:

Implement whitespace behavior in the standard (Add trimming option for the de::from_* functions #561 (comment)), which says string primitive types should preserve whitespace, while all other primitives have collapse behavior.

Mechanically, the proposal is to implement this by adding a content range to the Text type (as in Disable trim_text in Deserializer from_reader #285 (comment) and @Pastequee's b4355c6). Then, I think the various deserialize primitive methods would need to be updated to use the trimmed or untrimmed versions?

Would this be a breaking change to quick-xml? I guess that's allowed by v0.
Make the Deserializer configurable, specifically with a trim_text flag. This is @jamwil's approach in Add de::from_str_with_whitespace and de::from_reader_with_whitespace #855. What was the plan with this flag: would it be turned on/off globally, or would there be a per-field configuration as well? As mentioned by @Mingun, a global flag could be problematic for pretty-printed XML.
Some kind of per-field configuration. This is non-trivial because serde doesn't allow extra attributes (Add trimming option for the de::from_* functions #561 (comment)). Could we introduce another special field name like $raw, $text_untrimmed, ...? I am not yet clear how easy this would be to implement.
Other?

Which of these would be preferred?

jamwil · 2025-05-08T18:01:39Z

@mmcloughlin Thanks for distilling this. Personally, and despite having initially started on a different route, I like the idea in Option 1 of following the spec for string primitives if we're not to concern ourselves with backward compatibility.

In #855, I started down the path of having a mutable DeConfig struct owned by Deserializer, per this comment from @Mingun , but once I got into it more I felt I was going in circles a bit. I could be misunderstanding things (likely, I am—I'm fairly new to Rust), but there doesn't seem to be an opportunity or need to have a mutable configuration, since deserializing in this way is a declarative, one-shot operation. I also agree that a global flag is problematic.

mmcloughlin · 2025-05-08T19:46:48Z

if we're not to concern ourselves with backward compatibility

Feels like a situation where even if semantic versioning allows it, we'd need to be careful about Hyrum's Law.

jamwil · 2025-05-09T17:17:47Z

Very valid point. I think we do at least have @Mingun's blessing to explore adjusting the default based on this comment in #561 (emphasis my own):

While working on #574 I've found this: https://www.w3.org/TR/xmlschema11-2/#built-in-primitive-datatypes

This means, that Deserializer could itself determine if trimming is needed. Practically speaking, when deserialize with deserialize_borrowed_str / deserialize_str / deserialize_string we shouldn't trim, in all other cases should.

We could also add a flag "trim / not trim" to the Deserializer (which, actually, probably should be stored in ContentDeserializer), but I recommend first check if the behavior specified specified in the XML specification suits you. If yes, then implement it -- we shouldn't trim during reading, but only store a range with boundaries of not-trimmed content; apply trim when deserialize primitives except strings.

I'm no authority, but if the spec states that string primitives should preserve whitespace, and that simplifies our implementation using @Pastequee's work in b4355c6 as a base, then that approach has my vote. If compliance with the w3 spec is a stated or implied goal of quick-xml, then v0.x is our window to bring it into alignment.

Mingun · 2025-05-10T12:51:38Z

@mmcloughlin, thanks for summary. Yes, we definitely will implement option 1. @Pastequee catches the idea.

Then, I think the various deserialize primitive methods would need to be updated to use the trimmed or untrimmed versions?

Yes, and it seems that only the deserialize_str, deserialize_borrowed_str and deserialize_string should use untrimmed version, but need to check XML spec carefully. At the same time we could check the presence of the xml:space attribute like we did that for xsi:nil.

I'm not sure about option 2 (global trim_text), it seems will be unnecessary when other options will be implemented.

Option 3 also could be implemented to override default behavior from option 1, but maybe later if it will really be needed. At least this is purely non-breaking change which may be implemented in patch update.

If compliance with the w3 spec is a stated or implied goal of quick-xml, then v0.x is our window to bring it into alignment.

Yes, that is the goal (at least I want to have a mode where we will be fully compliance) and we will break compatibility if that will be needed, because v0.x is designed for that. After that we will release v1.

jamwil · 2025-05-10T15:56:02Z

Sounds like a plan. @mmcloughlin I'm going to close my wip PR. Feel free to put me to work in whatever way is most helpful.

mmcloughlin · 2025-05-10T16:49:16Z

Feel free to put me to work in whatever way is most helpful.

Haha, I'm not sure I can put you to work! Perhaps @Mingun is the one to listen to :)

According to @Mingun's last message, it seems there's consensus on trying Option 1: trimming for primitive types except strings. If you have time to work on that, that would be awesome.

Mingun added enhancement help wanted serde Issues related to mapping from Rust types to XML labels May 21, 2022

This was referenced Aug 17, 2022

Strip BOM from the event stream, add a method to Writer for writing BOM #459

Merged

There should be a configuration option for skipping non-meaningful text events in-between XML tags #461

Open

Mingun mentioned this issue Aug 25, 2022

Recognize and process some special XML attributes #464

Open

3 tasks

Mingun linked a pull request Mar 4, 2023 that will close this issue

Add trimming option for the de::from_* functions #561

Draft

Pastequee linked a pull request Mar 8, 2023 that will close this issue

Add trimming option for the de::from_* functions #561

Draft

r-glazkov mentioned this issue Sep 16, 2023

Avoid trimming elements with the text content r-glazkov/fb2#2

Open

dralley mentioned this issue Nov 29, 2023

New parser implementation #690

Draft

5 tasks

Mingun mentioned this issue Aug 9, 2024

How to deserialize an element contains empty string? #795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable `trim_text` in Deserializer from_reader #285

Disable `trim_text` in Deserializer from_reader #285

woodworker commented Apr 17, 2021

ImJeremyHe commented May 6, 2021

tafia commented May 12, 2021

Mingun commented May 21, 2022

naumazeredo commented Jul 28, 2022

Mingun commented Jul 28, 2022 •

edited

Loading

Mingun commented Jul 28, 2022

naumazeredo commented Jul 29, 2022

dralley commented Jul 29, 2022 •

edited

Loading

Mingun commented Jul 29, 2022

Pastequee commented Dec 29, 2022

Mingun commented Dec 29, 2022

Mingun commented Mar 4, 2023

nsunderland1 commented Nov 8, 2023

Mingun commented Nov 8, 2023

mmcloughlin commented May 7, 2025

jamwil commented May 7, 2025

mmcloughlin commented May 7, 2025

mmcloughlin commented May 8, 2025

jamwil commented May 8, 2025

mmcloughlin commented May 8, 2025

jamwil commented May 9, 2025

Mingun commented May 10, 2025

jamwil commented May 10, 2025

mmcloughlin commented May 10, 2025

Disable trim_text in Deserializer from_reader #285

Disable trim_text in Deserializer from_reader #285

Comments

woodworker commented Apr 17, 2021

ImJeremyHe commented May 6, 2021

tafia commented May 12, 2021

Mingun commented May 21, 2022

naumazeredo commented Jul 28, 2022

Mingun commented Jul 28, 2022 • edited Loading

Mingun commented Jul 28, 2022

naumazeredo commented Jul 29, 2022

dralley commented Jul 29, 2022 • edited Loading

Mingun commented Jul 29, 2022

Pastequee commented Dec 29, 2022

Mingun commented Dec 29, 2022

Mingun commented Mar 4, 2023

nsunderland1 commented Nov 8, 2023

Mingun commented Nov 8, 2023

mmcloughlin commented May 7, 2025

jamwil commented May 7, 2025

mmcloughlin commented May 7, 2025

mmcloughlin commented May 8, 2025

jamwil commented May 8, 2025

mmcloughlin commented May 8, 2025

jamwil commented May 9, 2025

Mingun commented May 10, 2025

jamwil commented May 10, 2025

mmcloughlin commented May 10, 2025

Disable `trim_text` in Deserializer from_reader #285

Disable `trim_text` in Deserializer from_reader #285

Mingun commented Jul 28, 2022 •

edited

Loading

dralley commented Jul 29, 2022 •

edited

Loading