Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attr_list: add option to preserve attributes without value as-is #1501

Open
FeldrinH opened this issue Jan 16, 2025 · 17 comments
Open

attr_list: add option to preserve attributes without value as-is #1501

FeldrinH opened this issue Jan 16, 2025 · 17 comments
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions. someday-maybe Approved low priority request.

Comments

@FeldrinH
Copy link

Currently, if you have an attribute without value then attr_list defaults the value to the name of the attribute. For example:

[download](file.txt){download}

is rendered as

<a download="download" href="file.txt">download</a>

I would like to have an option to preserve it as simply download without value, so the output would be this:

<a download href="file.txt">download</a>

This would make the attribute list in markdown more consistent with the produced HTML.

PS: What was the reasoning behind the current behavior? I can't think of a scenario where I would want the attribute to have its name as the value.

@facelessuser
Copy link
Collaborator

IIRC it is simply how etree works. I don't Python Markdown especially cares about how it is output, they are just limited by the behavior of etree.

Attributes are defined with a key and value. You have to define something for the value:

>>> tag = etree.Element('div', {'download': ''})
>>> etree.tostring(tag)
b'<div download="" />'
>>> tag = etree.Element('div', {'download': 'download'})
>>> etree.tostring(tag)
b'<div download="download" />'

What's more correct? What would you suggest?

The only thing I can think of is having some sort of post processer to strip out values if desired, but then you have to know how to identify with the value is meant to be stripped or not meant to be stripped.

Anyway, I don't know that this is likely to change unless the document handler was changed and it supported a way to do what you want, and I doubt Python Markdown will be doing such a massive change like replacing the HTML handler at this point.

@waylan
Copy link
Member

waylan commented Jan 16, 2025

This is dependent on the output_format. Python-Markdown defaults to outputting XHTML for historical reasons and an attribute without a value is not valid XHTML. However, if you change the output_format to html then you get the requested behavior.

>>> markdown.markdown('[download](file.txt){download}', extensions=['attr_list'])
'<p><a download="download" href="file.txt">download</a></p>'
>>> markdown.markdown('[download](file.txt){download}', extensions=['attr_list'], output_format='html')
'<p><a download href="file.txt">download</a></p>'

There is an argument to be made that we could change the default here. However, we have avoided that because XHTML is still valid HTML5, but HTML5 is not always valid XHTML. If any users are still using this lib to generate strict XHML pages, then a change to the default would break their pages. However, leaving the default as-is breaks nothing while allowing those users who care to change the output to match their desired format.

@waylan
Copy link
Member

waylan commented Jan 16, 2025

@facelessuser as a reminder, we use our own custom serializer, not etree's built-in one. And that allows us to address things like this.

if k == v and format == 'html':
# handle boolean attributes
write(" %s" % v)
else:
write(' {}="{}"'.format(k, v))

@waylan waylan added the support Support request. label Jan 16, 2025
@waylan waylan closed this as not planned Won't fix, can't repro, duplicate, stale Jan 16, 2025
@facelessuser
Copy link
Collaborator

@facelessuser as a reminder, we use our own custom serializer, not etree's built-in one. And that allows us to address things like this.

Oh, right, I forgot about that. I'd say in modern day, XHMTL is used nearly as much anymore, but I get the desire to not change the status quo.

@FeldrinH
Copy link
Author

FeldrinH commented Jan 16, 2025

This is dependent on the output_format. Python-Markdown defaults to outputting XHTML for historical reasons and an attribute without a value is not valid XHTML. However, if you change the output_format to html then you get the requested behavior.

>>> markdown.markdown('[download](file.txt){download}', extensions=['attr_list'])
'<p><a download="download" href="file.txt">download</a></p>'
>>> markdown.markdown('[download](file.txt){download}', extensions=['attr_list'], output_format='html')
'<p><a download href="file.txt">download</a></p>'

This inconsistency seems quite problematic. Modern browsers treat download="" and download as equivalent (AFAIK this is generally true for any HTML attribute without an explicit value), but download="download" is treated differently (it gives the downloaded file a default name of "download"). So the current behavior creates a situation where the behavior of the link changes when changing the output format from XHTML to HTML (again, AFAIK this is generally true for any attr_list attribute without an explicit value).

I argue that for consistency the default value of an attribute without value in XHTML should be the blank string (""), not the name of the attribute.

@waylan
Copy link
Member

waylan commented Jan 17, 2025

Modern browsers treat download="" and download as equivalent (AFAIK this is generally true for any HTML attribute without an explicit value), but download="download" is treated differently (it gives the downloaded file a default name of "download").

@FeldrinH can you point to where in the XHTML spec this behavior is specified? Or is this a situation where the browser acts correctly only if strict XHTML is used (proper headers and doc type forces browser to recognize a page as strict XHTML such that the page fails to render if the XHTML is not valid XML)? And is this the behavior for any key-only attribute or only the download attribute?

@FeldrinH
Copy link
Author

FeldrinH commented Jan 17, 2025

I specifically meant that modern browsers treat download as equivalent to download="" in HTML. It is specified here in the spec for all attributes:
Image
https://html.spec.whatwg.org/#attributes-2S

For XHTML attributes without values are not allowed, so my proposal is that download and other attributes without values should be output as download="" in XHTML, to match the behavior of modern HTML.

@waylan
Copy link
Member

waylan commented Jan 21, 2025

I understood you the first time. However, you are pointing at the HTML spec. If we are going to change the output for XHTML (as you are suggesting), then I want to see where the XHTML spec supports the proposed change. However, if you want valid HTML, then you should be specifying html at the output format and you will get valid HTML with no changes to the code being necessary.

@FeldrinH
Copy link
Author

FeldrinH commented Jan 21, 2025

I don't quite understand what you are asking for. I am certain even without reading the spec that XHTML supports download="" and I don't see what else in the XHTML spec would be relevant.

@waylan
Copy link
Member

waylan commented Jan 27, 2025

So far you have not demonstrated that our output is invalid. If you can point to where the XHTML spec indicates that download="download" is invalid XHTML and download="" is valid XHTML, then we will make a change. Otherwise, we intend to leave the behavior as-is.

@FeldrinH
Copy link
Author

Both are valid. My problem is that currently switching from XHTML output to HTML output changes the meaning of the generated HTML/XHTML document.

@FeldrinH
Copy link
Author

[download](file.txt){download} is rendered as <a download="download" href="file.txt">download</a> in XHTML, but <a download href="file.txt">download</a> in HTML. These produce different behavior in the browser. My proposal would correct this inconsistency.

@waylan
Copy link
Member

waylan commented Jan 28, 2025

These produce different behavior in the browser.

Browsers only recognize XHTML as XHTML if the proper doc type is defined. Have you tested this properly? I don't know because you keep making assertions with nothing to back them up.

@FeldrinH
Copy link
Author

Browsers only recognize XHTML as XHTML if the proper doc type is defined.

AFAIK in modern browsers XHTML and HTML just have syntax and validation differences. (I couldn't find an explicit source for this, but any sources I've found listing the differences between HTML and XHTML only mention extra syntax restrictions and validation in XHTML.) The meaning of an attribute value shouldn't change regardless of what the browser recognizes the document as. So download="download" will have the same behavior in HTML and XHTML.

Have you tested this properly? I don't know because you keep making assertions with nothing to back them up.

I did test that download="download" and download behave differently in modern Chrome and Firefox, whereas download="" and download behave the same. The test results did not change with any doctype.

Here is my test setup: xhtmltest.zip. Serve it with any static file server and try for yourself. In my tests download and download="" result in the file being named testfile.txt, whereas download="download" results in the file being named download.txt or just download.

@waylan
Copy link
Member

waylan commented Feb 21, 2025

I actually had some time to look at this today. The issue is that download is not strictly a boolean attribute (it is only boolean when no value is assigned, but not when a value is assigned). Additionally, AFAICT it was only introduced in HTML5. It is not mentioned at all in the HTML4 or XHTML specs. In other words, download="download" is not valid XHTML, but neither is download="". However, both checked="checked" and checked="" are valid and behave the same. So the question is: how do we handle this?

I note that the XHTML spec lists all known boolean attributes (compact, nowrap, ismap, declare, noshade, checked, disabled, readonly, multiple, selected, noresize, defer). We could change behavior based on this list when the output_format is xhtml. In other words, we would recognize that any attribute not on this list is not a boolean in XHTML mode and output differently (such as download="").

Why not just change all booleans to key=""? Because, existing users with existing documents could be relying on the existing behavior in various ways. We don't want to fix what is not broken. It is only broken for non-boolean attributes as defined by the XHTML spec.

There is one additional complication. When we are building the tree in our parser, we immediately assign key="key" regardless of whether the original source had key, key="" or key="key". And from that point on, any code which looks for boolean attributes relies on the key and value being equal. The seemingly obvious approach would be to better preserve the source. However, that would break any existing boolean attribute detection in third-party extensions. In order not to break those, we can only make changes at the render stage. And that means that if someone actually wants download="download", the suggested change would make that impossible.

In other words, this is much more complicated than it may seem at first.

In the short term, I will say that using download as an attribute in an XHTML document (in any format) is not strictly valid. Therefore, we can't guarantee that it will behave as desired. In other words, don't use the download attribute when output_format is xhtml.

@FeldrinH
Copy link
Author

FeldrinH commented Mar 1, 2025

And that means that if someone actually wants download="download", the suggested change would make that impossible.

I think it's worth noting that this issue already exists for the HTML output format:

>>> markdown.markdown('[download](file.txt){download="download"}', extensions=['attr_list'], output_format='html')
'<p><a download href="file.txt">download</a></p>'

@waylan
Copy link
Member

waylan commented Mar 12, 2025

An argument could be made that a user setting {download="download"} for any output format, but if that is the case, then the expectation would typically be to preserve what the user provided. However, there is no guarantee that that will happen. Bad input results in bad output. Ultimately, that is what is going on here and so I am not particularly motivated to address this.

That said, if we were going to do anything, I think the only reasonable solution would be to incorporate a list of known boolean attributes (which may be different depending on output format but would not include download for either format). Then, when parsing, we maintain our current behavior only for those known attributes. For any other attributes, we introduce a new behavior of preserving whatever the user provided. In that way we can maintain backward compatibility for the existing boolean attributes while also avoiding introducing invalid markup for any non-boolean attribute. Of course, if the document author introduces invalid markup (uses a non-boolean attribute as a boolean), then we will pass that through unaltered, which is the standard policy for any user provided markup.

So how would we implement that? After all, internally we have no way to represent key without any value assigned to it. I suggest representing key as key = None (or maybe key = True) internally. Then the serializer would render that as key as it is not a known boolean attribute.

@waylan waylan reopened this Mar 12, 2025
@waylan waylan added bug Bug report. someday-maybe Approved low priority request. extension Related to one or more of the included extensions. confirmed Confirmed bug report or approved feature request. and removed support Support request. labels Mar 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug report. confirmed Confirmed bug report or approved feature request. extension Related to one or more of the included extensions. someday-maybe Approved low priority request.
Projects
None yet
Development

No branches or pull requests

3 participants