Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces something similar to (but not quite the same as) newbytes(repr(newstr(...)), '<encoding>') #171

Open
posita opened this issue Sep 26, 2015 · 9 comments

Comments

@posita
Copy link
Contributor

posita commented Sep 26, 2015

On Python 3.4:

>>> from __future__ import print_function, unicode_literals ; from builtins import *
>>> bytes
<class 'bytes'>
>>> str
<class 'str'>
>>> b1 = str(u'abc \u0123 do re mi').encode(u'utf_8') # this works
>>> b1
b'abc \xc4\xa3 do re mi'
>>> b2 = bytes(u'abc \u0123 do re mi', u'utf_8') # so does this
>>> b2
b'abc \xc4\xa3 do re mi'
>>> b1 == b2
True
>>> b3 = bytes(str(u'abc \u0123 do re mi'), u'utf_8') # this works too (unsurprisingly)
>>> b3
b'abc \xc4\xa3 do re mi'
>>> b1 == b3
True

On Python 2.7:

>>> from __future__ import print_function, unicode_literals ; from builtins import *
>>> bytes
<class 'future.types.newbytes.newbytes'>
>>> str
<class 'future.types.newstr.newstr'>
>>> b1 = str(u'abc \u0123 do re mi').encode(u'utf_8') # this works
>>> b1
b'abc \xc4\xa3 do re mi'
>>> type(b1)
<class 'future.types.newbytes.newbytes'>
>>> b2 = bytes(u'abc \u0123 do re mi', u'utf_8') # so does this (argument is native unicode object)
>>> b2
b'abc \xc4\xa3 do re mi'
>>> b1 == b2
True
>>> b3 = bytes(str(u'abc \u0123 do re mi'), u'utf_8') # but this looks like it's encoding the repr() of the newstr
>>> b3
b"b'abc \xc4\xa3 do re mi'"
>>> b1 == b3
False
>>> # I can't figure out what it's actually doing though; these aren't quite the same
>>> bytes(repr(str(u'abc \u0123 do re mi')).encode(u'utf_8'))
b"'abc \\u0123 do re mi'"
>>> bytes(repr(str(u'abc \u0123 do re mi').encode(u'utf_8')), 'utf_8')
b"b'abc \\xc4\\xa3 do re mi'"
@posita posita changed the title newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces newbytes(repr(newstr(...)), '<encoding>') newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces something similar to (but not quite the same as) newbytes(repr(newstr(...)), '<encoding>') Sep 26, 2015
@edschofield
Copy link
Contributor

edschofield commented Sep 27, 2015 via email

@posita
Copy link
Contributor Author

posita commented Sep 27, 2015

The good news is that there's an easy work around (modified from above):

from builtins import *
s = str(...) # make a newstr on Python 2
# Instead of bytes(s, u'utf_8'), do:
s.encode(u'utf_8') # will return newbytes object on Python 2

So this issue is really about interface compatibility rather than about available functionality. My guess is that most people use the str(...).encode(<encoding>) method as opposed to the bytes(..., <encoding>) method, which may be why this hasn't been discovered yet?

@posita
Copy link
Contributor Author

posita commented Oct 1, 2015

I think I see what is happening here. (The following code snippets are all from Python 2, in case that wasn't obvious.) Consider:

>>> from future.types.newstr import newstr
>>> isinstance(newstr(u'asdf'), unicode)
True

From future/types/newbytes.py at 93:

class newbytes(with_metaclass(BaseNewBytes, _builtin_bytes)):
    ...
    def __new__(cls, *args, **kwargs):
        ... # gets `encoding` from *args, **kwargs
        elif isinstance(args[0], unicode):
            ...
            newargs = [encoding]
            ...
            value = args[0].encode(*newargs)
            ...
        return super(newbytes, cls).__new__(cls, value)

Which, in the case of newbytes(newstr(u'asdf'), u'utf_8') is basically:

value = <newstr>.encode(u'utf_8') # returns instance of <class 'future.types.newbytes.newbytes'>
value = super(<newbytes>, cls).__new__(cls, value) # but see below

So newbytes's parent constructor (i.e., from builtin bytes) is being called with a newbytes instance as its argument. We'll see why this is a problem below, but first consider:

>>> nativebytes = bytes ; nativestr = str ; from builtins import *
>>> bytes(bytes(b'asdf'))
b'asdf'
*>>> bytes(nativebytes(b'asdf'))
b'asdf'
>>> nativebytes(bytes(b'asdf'))
"b'asdf'" # whoops!

From future/types/newbytes.py at 120:

def __str__(self):
    return 'b' + "'{0}'".format(super(newbytes, self).__str__())

This behavior mirrors Python 3, so it's correct. However, because the native bytes constructor doesn't know how to deal with a newbytes argument, it's calling its __str__() method to figure out how to populate itself, what's really happening is something like this:

<newbytes> = <newstr>.encode(u'utf_8') # returns instance of <class 'future.types.newbytes.newbytes'>
value = super(<newbytes>, cls).__new__(cls, <newbytes>.__str__())

I'm not quite sure what the correct fix is if one wants to safely allow for the ability derive subclasses from both newbytes and newstr.

@posita posita changed the title newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces something similar to (but not quite the same as) newbytes(repr(newstr(...)), '<encoding>') FIXED IN #173 - newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces something similar to (but not quite the same as) newbytes(repr(newstr(...)), '<encoding>') Nov 21, 2015
@edschofield
Copy link
Contributor

edschofield commented May 22, 2016

Matt, thanks for filing this issue and your pull request!

The tests that this issue mentions actually seem to be passing on the v0.15.x branch for me. Could you please confirm whether this is true in your testing too? I'm wondering whether I still need to merge in your PR #173.

@posita
Copy link
Contributor Author

posita commented Jun 14, 2016

@edschofield, my apologies for the delay. After some investigation, it looks like #193 is a duplicate of this issue. In response to your question, I no longer see the behavior addressed by #173 after abf19bb. 👍

I'll close #173, but please bear in mind that some of the errant (or at least confusing) behavior still exists. From #173 (comment):

[W]ithout monkey-patching Python 2's native str's constructor, I do not know how to handle this case:

Python 2.7.10 (default, Sep 24 2015, 10:13:45)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>> nativebytes = bytes ; nativestr = str ; from builtins import *
>> nativebytes(bytes(b'asdf'))
"b'asdf'" # Whoops!
>> # This means you can't pass newbytes in many contexts, such as:
>> from urllib import urlencode
>> urlencode({ bytes(b'a'): 1, bytes(b'b'): 2 })
'b%27a%27=1&b%27b%27=2'
>> # :o(

That behavior remains and is not addressed by either #173 or abf19bb. I'll leave it to you as to whether you want to close this issue and track the above via #193.

@depau
Copy link

depau commented Mar 17, 2017

This is still an issue.

Running str(bytes(b"hello")) results in "b'hello'".

@posita
Copy link
Contributor Author

posita commented Mar 17, 2017

Hmmm. I'm not sure this is broken. Or at least if it is, it might be broken semi-consistently with Python 3.x:

$ python -c 'import sys ; print(sys.version) ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; nativestr = str ; nativebytes = bytes ; from builtins import * ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; c = "str(nativebytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c))) ; c = "nativestr(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c))) ; c = "str(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c)))'
2.7.13 (default, Dec 18 2016, 17:56:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
type(bytes): <type 'type'>
type(str): <type 'type'>
str(b"asdf"): asdf
type(bytes): <class 'future.types.newbytes.BaseNewBytes'>
type(str): <class 'future.types.newstr.BaseNewStr'>
str(b"asdf"): asdf
str(nativebytes(b"asdf")): asdf
nativestr(bytes(b"asdf")): b'asdf'
str(bytes(b"asdf")): b'asdf'
$ python3.5 -c 'import sys ; print(sys.version) ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; c = "str(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c)))'
3.5.3 (default, Feb  1 2017, 17:52:10)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
type(bytes): <class 'type'>
type(str): <class 'type'>
str(b"asdf"): b'asdf'
str(bytes(b"asdf")): b'asdf'

So newstr(newbytes(b'asdf')) mirrors the Python 3 behavior, as does nativestr(newbytes(b'asdf')). newstr(nativebytes(b'asdf')) does not, however. EDIT: In fairness, I don't think it should. In Python 2, newstr(nativebytes(…)) is equivalent to newstr(nativestr(…)) is probably not something that should end up as b'…'.

@Depaulicious, also note that the original issue was about passing a Unicode value to newbytes vs passing it to Python 3's native bytes. Yours is an inverted case.

@posita posita changed the title FIXED IN #173 - newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces something similar to (but not quite the same as) newbytes(repr(newstr(...)), '<encoding>') newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces something similar to (but not quite the same as) newbytes(repr(newstr(...)), '<encoding>') Mar 17, 2017
cpascual pushed a commit to cpascual/taurus that referenced this issue Jan 2, 2019
The encoding to bytes in getModelMimeData implementations is still
problematic due to a bug in future.builtins.str
(PythonCharmers/python-future#171)
Avoid it by using the encode method of str instead in the
getModelMimeData implementations of:
- TaurusBaseComponent
- TaurusJDrawSynopticsView
- TaurusValue
@rectalogic
Copy link
Contributor

I'm hitting this in Python 2 (passing an encoded string into a library that uses native Python 2 urllib):

import urllib

from future import standard_library
standard_library.install_aliases()
from builtins import *

d = {"k": str('a@b').encode("utf-8")}
urllib.urlencode(d, doseq=True)
Traceback (most recent call last):
  File "/tmp/f.py", line 10, in <module>
    urllib.urlencode(d, doseq=True)
  File "/usr/lib/python2.7/urllib.py", line 1348, in urlencode
    v = quote_plus(v)
  File "/usr/lib/python2.7/urllib.py", line 1305, in quote_plus
    return quote(s, safe)
  File "/usr/lib/python2.7/urllib.py", line 1298, in quote
    return ''.join(map(quoter, s))
KeyError: 97

Is there a workaround for this?

@rectalogic
Copy link
Contributor

So urllib.quote wants to map over the string https://github.com/python/cpython/blob/2.7/Lib/urllib.py#L1298, but newbytes returns integers instead of string chars like python2 str. urllib.quote is looking these up in a map, but none of the ordinals exist so it raises KeyError.

>>> from future import standard_library
>>> standard_library.install_aliases()
>>> from builtins import *
>>> [c for c in 'a@b']
['a', '@', 'b']
>>> [c for c in str('a@b').encode("utf-8")]
[97, 64, 98]

So if I encode a future str, the resulting newbytes is not useable with 3rd party python libraries that may use the unpatched standard library?

arizvisa added a commit to arizvisa/syringe that referenced this issue Feb 4, 2023
…ytes" builtin with "builtins.bytes" during serialization.

This manifested itself due to the proxy-oriented providers explicitly using "builtin.bytes" to
convert a bytearray to bytes and only occurs in Py2 (when "builtins" actualy does things).

It's probably related to PythonCharmers/python-future#171, but I didn't check too hard.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants