Skip to content

Almost all non-utf8 encodings are incorrect #680

@ChALkeR

Description

@ChALkeR

Even the most basic windows-1252 which latin1 and ascii alias to:

const express = require('express')
const bodyParser = require('body-parser')

const app = express()
app.use(bodyParser.urlencoded())
app.use(bodyParser.json())
app.use(bodyParser.text())

app.use(function (req, res) {
  res.setHeader('Content-Type', 'text/plain')
  res.write('you posted:\n')
  res.write(`${escape(req.body)}\n`)
  res.end(String(req.body))
})

app.listen(8080, async () => {
  const res = await fetch('http://localhost:8080/', {
    method: 'POST',
    headers: { 'content-type': 'text/plain; charset=windows-1252' },
    body: Uint8Array.of(0x80, 0x81, 0x82, 0x83, 0x8d, 0x9e, 0x9f)
  })

  console.log(await res.text())
})

Results in:

you posted:
%u20AC%uFFFD%u201A%u0192%uFFFD%u017E%u0178
€�‚ƒ�žŸ

But it should be €\x81‚ƒ\x8DžŸ instead, with no replacement chars
i.e. %u20AC%81%u201A%u0192%8D%u017E%u0178 escaped

See Encoding Standard: https://encoding.spec.whatwg.org/
All characters are mapped in https://encoding.spec.whatwg.org/index-windows-1252.txt, including 0x81 and 0x8D.

Same goes for other encodings: half of single-bytes are mapped incorrectly and contradict the spec: all of windows-* family except windows-1256, koi8-u, macintosh.


All of legacy multi-bytes that are supported also behave incorrectly


UTF-16 also behaves incorrectly:

const express = require('express')
const bodyParser = require('body-parser')

const app = express()
app.use(bodyParser.urlencoded())
app.use(bodyParser.json())
app.use(bodyParser.text())

app.use(function (req, res) {
  res.setHeader('Content-Type', 'text/plain')
  res.write('you posted:\n')
  res.write(`Is well formed: ${req.body.isWellFormed()}\n`)
  res.write(`${escape(req.body)}\n`)
  res.end(String(req.body))
})

app.listen(8080, async () => {
  const res = await fetch('http://localhost:8080/', {
    method: 'POST',
    headers: { 'content-type': 'text/plain; charset=utf-16le' },
    body: Uint8Array.of(0, 0xd8, 0, 0xd8)
  })

  console.log(await res.text())
})

Results in:

you posted:
Is well formed: false
%uD800%uD800
��

But per spec it should never produce non-well-formed strings and should instead have produced replacements chars, i.e. %uFFFD%uFFFD escaped

See spec: https://encoding.spec.whatwg.org/#shared-utf-16-decoder

This could have potential security impact

These decoders are enabled in the default configuration
The default utf-8 decoder never produces non-well-formed strings, but the client can force that by specifying utf-16 encoding, while per spec that shouldn't be possible (produced strings should be always well-formed)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions