Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: sort strings in UTF-8 encoded byte order with lazy encoding #8787

Merged
merged 16 commits into from
Mar 10, 2025
Merged
6 changes: 6 additions & 0 deletions .changeset/large-pants-hide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
'@firebase/firestore': patch
'firebase': patch
---

Use lazy encoding in UTF-8 encoded byte comparison for strings.
Original file line number Diff line number Diff line change
Expand Up @@ -655,5 +655,9 @@ export function dbKeyComparator(l: DocumentKey, r: DocumentKey): number {
return cmp;
}

// TODO(b/329441702): Document IDs should be sorted by UTF-8 encoded byte
// order, but IndexedDB sorts strings lexicographically. Document ID
// comparison here still relies on primitive comparison to avoid mismatches
// observed in snapshot listeners with Unicode characters in documentIds
return primitiveComparator(left[left.length - 1], right[right.length - 1]);
}
11 changes: 3 additions & 8 deletions packages/firestore/src/model/path.ts
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import { Integer } from '@firebase/webchannel-wrapper/bloom-blob';

import { debugAssert, fail } from '../util/assert';
import { Code, FirestoreError } from '../util/error';
import { compareUtf8Strings, primitiveComparator } from '../util/misc';

export const DOCUMENT_KEY_NAME = '__name__';

Expand Down Expand Up @@ -181,7 +182,7 @@ abstract class BasePath<B extends BasePath<B>> {
return comparison;
}
}
return Math.sign(p1.length - p2.length);
return primitiveComparator(p1.length, p2.length);
}

private static compareSegments(lhs: string, rhs: string): number {
Expand All @@ -201,13 +202,7 @@ abstract class BasePath<B extends BasePath<B>> {
);
} else {
// both non-numeric
if (lhs < rhs) {
return -1;
}
if (lhs > rhs) {
return 1;
}
return 0;
return compareUtf8Strings(lhs, rhs);
}
}

Expand Down
10 changes: 7 additions & 3 deletions packages/firestore/src/model/values.ts
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,11 @@ import {
Value
} from '../protos/firestore_proto_api';
import { fail } from '../util/assert';
import { arrayEquals, primitiveComparator } from '../util/misc';
import {
arrayEquals,
compareUtf8Strings,
primitiveComparator
} from '../util/misc';
import { forEach, objectSize } from '../util/obj';
import { isNegativeZero } from '../util/types';

Expand Down Expand Up @@ -251,7 +255,7 @@ export function valueCompare(left: Value, right: Value): number {
getLocalWriteTime(right)
);
case TypeOrder.StringValue:
return primitiveComparator(left.stringValue!, right.stringValue!);
return compareUtf8Strings(left.stringValue!, right.stringValue!);
case TypeOrder.BlobValue:
return compareBlobs(left.bytesValue!, right.bytesValue!);
case TypeOrder.RefValue:
Expand Down Expand Up @@ -400,7 +404,7 @@ function compareMaps(left: MapValue, right: MapValue): number {
rightKeys.sort();

for (let i = 0; i < leftKeys.length && i < rightKeys.length; ++i) {
const keyCompare = primitiveComparator(leftKeys[i], rightKeys[i]);
const keyCompare = compareUtf8Strings(leftKeys[i], rightKeys[i]);
if (keyCompare !== 0) {
return keyCompare;
}
Expand Down
57 changes: 57 additions & 0 deletions packages/firestore/src/util/misc.ts
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
*/

import { randomBytes } from '../platform/random_bytes';
import { newTextEncoder } from '../platform/text_serializer';

import { debugAssert } from './assert';

Expand Down Expand Up @@ -74,6 +75,62 @@ export interface Equatable<T> {
isEqual(other: T): boolean;
}

/** Compare strings in UTF-8 encoded byte order */
export function compareUtf8Strings(left: string, right: string): number {
let i = 0;
while (i < left.length && i < right.length) {
const leftCodePoint = left.codePointAt(i)!;
const rightCodePoint = right.codePointAt(i)!;

if (leftCodePoint !== rightCodePoint) {
if (leftCodePoint < 128 && rightCodePoint < 128) {
// ASCII comparison
return primitiveComparator(leftCodePoint, rightCodePoint);
} else {
// Lazy instantiate TextEncoder
const encoder = newTextEncoder();

// UTF-8 encode the character at index i for byte comparison.
const leftBytes = encoder.encode(getUtf8SafeSubstring(left, i));
const rightBytes = encoder.encode(getUtf8SafeSubstring(right, i));
for (
let j = 0;
j < Math.min(leftBytes.length, rightBytes.length);
j++
) {
const comp = primitiveComparator(leftBytes[j], rightBytes[j]);
if (comp !== 0) {
return comp;
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we're having this loop because we can't use Buffer.from(...) in the browser (outside of Node). If leftBytes.length is not equal to rightBytes.length and the first Math.min(...) bytes do match, then this loop won't return a comparison. So the code falls through to the primitiveComparator (line 112) way more often than it should.

  • can we use the compareBlobs function?
  • if not, can you take this loop out of here and make a helper function that compares 2 Uint8Arrays byte by byte and also considers the length of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we cannot import compareBlobs as it would create circular dependency due to the ByteString.

// EXTREMELY RARE CASE: Code points differ, but their UTF-8 byte
// representations are identical. This can happen with malformed input
// (invalid surrogate pairs). The backend also actively prevents invalid
// surrogates as INVALID_ARGUMENT errors, so we almost never receive
// invalid strings from backend.
// Fallback to code point comparison for graceful handling.
return primitiveComparator(leftCodePoint, rightCodePoint);
}
}
// Increment by 2 for surrogate pairs, 1 otherwise
i += leftCodePoint > 0xffff ? 2 : 1;
}

// Compare lengths if all characters are equal
return primitiveComparator(left.length, right.length);
}

function getUtf8SafeSubstring(str: string, index: number): string {
const firstCodePoint = str.codePointAt(index)!;
if (firstCodePoint > 0xffff) {
// It's a surrogate pair, return the whole pair
return str.substring(index, index + 2);
} else {
// It's a single code point, return it
return str.substring(index, index + 1);
}
}

export interface Iterable<V> {
forEach: (cb: (v: V) => void) => void;
}
Expand Down
239 changes: 239 additions & 0 deletions packages/firestore/test/integration/api/database.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2424,4 +2424,243 @@ apiDescribe('Database', persistence => {
});
});
});

describe('Sort unicode strings', () => {
const expectedDocs = [
'b',
'a',
'h',
'i',
'c',
'f',
'e',
'd',
'g',
'k',
'j'
];
it('snapshot listener sorts unicode strings the same as server', async () => {
const testDocs = {
'a': { value: 'Łukasiewicz' },
'b': { value: 'Sierpiński' },
'c': { value: '岩澤' },
'd': { value: '🄟' },
'e': { value: 'P' },
'f': { value: '︒' },
'g': { value: '🐵' },
'h': { value: '你好' },
'i': { value: '你顥' },
'j': { value: '😁' },
'k': { value: '😀' }
};

return withTestCollection(persistence, testDocs, async collectionRef => {
const orderedQuery = query(collectionRef, orderBy('value'));

const getSnapshot = await getDocsFromServer(orderedQuery);
expect(toIds(getSnapshot)).to.deep.equal(expectedDocs);

const storeEvent = new EventsAccumulator<QuerySnapshot>();
const unsubscribe = onSnapshot(orderedQuery, storeEvent.storeEvent);
const watchSnapshot = await storeEvent.awaitEvent();
expect(toIds(watchSnapshot)).to.deep.equal(toIds(getSnapshot));

unsubscribe();

await checkOnlineAndOfflineResultsMatch(orderedQuery, ...expectedDocs);
});
});

it('snapshot listener sorts unicode strings in array the same as server', async () => {
const testDocs = {
'a': { value: ['Łukasiewicz'] },
'b': { value: ['Sierpiński'] },
'c': { value: ['岩澤'] },
'd': { value: ['🄟'] },
'e': { value: ['P'] },
'f': { value: ['︒'] },
'g': { value: ['🐵'] },
'h': { value: ['你好'] },
'i': { value: ['你顥'] },
'j': { value: ['😁'] },
'k': { value: ['😀'] }
};

return withTestCollection(persistence, testDocs, async collectionRef => {
const orderedQuery = query(collectionRef, orderBy('value'));

const getSnapshot = await getDocsFromServer(orderedQuery);
expect(toIds(getSnapshot)).to.deep.equal(expectedDocs);

const storeEvent = new EventsAccumulator<QuerySnapshot>();
const unsubscribe = onSnapshot(orderedQuery, storeEvent.storeEvent);
const watchSnapshot = await storeEvent.awaitEvent();
expect(toIds(watchSnapshot)).to.deep.equal(toIds(getSnapshot));

unsubscribe();

await checkOnlineAndOfflineResultsMatch(orderedQuery, ...expectedDocs);
});
});

it('snapshot listener sorts unicode strings in map the same as server', async () => {
const testDocs = {
'a': { value: { foo: 'Łukasiewicz' } },
'b': { value: { foo: 'Sierpiński' } },
'c': { value: { foo: '岩澤' } },
'd': { value: { foo: '🄟' } },
'e': { value: { foo: 'P' } },
'f': { value: { foo: '︒' } },
'g': { value: { foo: '🐵' } },
'h': { value: { foo: '你好' } },
'i': { value: { foo: '你顥' } },
'j': { value: { foo: '😁' } },
'k': { value: { foo: '😀' } }
};

return withTestCollection(persistence, testDocs, async collectionRef => {
const orderedQuery = query(collectionRef, orderBy('value'));

const getSnapshot = await getDocsFromServer(orderedQuery);
expect(toIds(getSnapshot)).to.deep.equal(expectedDocs);

const storeEvent = new EventsAccumulator<QuerySnapshot>();
const unsubscribe = onSnapshot(orderedQuery, storeEvent.storeEvent);
const watchSnapshot = await storeEvent.awaitEvent();
expect(toIds(watchSnapshot)).to.deep.equal(toIds(getSnapshot));

unsubscribe();

await checkOnlineAndOfflineResultsMatch(orderedQuery, ...expectedDocs);
});
});

it('snapshot listener sorts unicode strings in map key the same as server', async () => {
const testDocs = {
'a': { value: { 'Łukasiewicz': true } },
'b': { value: { 'Sierpiński': true } },
'c': { value: { '岩澤': true } },
'd': { value: { '🄟': true } },
'e': { value: { 'P': true } },
'f': { value: { '︒': true } },
'g': { value: { '🐵': true } },
'h': { value: { '你好': true } },
'i': { value: { '你顥': true } },
'j': { value: { '😁': true } },
'k': { value: { '😀': true } }
};

return withTestCollection(persistence, testDocs, async collectionRef => {
const orderedQuery = query(collectionRef, orderBy('value'));

const getSnapshot = await getDocsFromServer(orderedQuery);
expect(toIds(getSnapshot)).to.deep.equal(expectedDocs);

const storeEvent = new EventsAccumulator<QuerySnapshot>();
const unsubscribe = onSnapshot(orderedQuery, storeEvent.storeEvent);
const watchSnapshot = await storeEvent.awaitEvent();
expect(toIds(watchSnapshot)).to.deep.equal(toIds(getSnapshot));

unsubscribe();

await checkOnlineAndOfflineResultsMatch(orderedQuery, ...expectedDocs);
});
});

it('snapshot listener sorts unicode strings in document key the same as server', async () => {
const testDocs = {
'Łukasiewicz': { value: true },
'Sierpiński': { value: true },
'岩澤': { value: true },
'🄟': { value: true },
'P': { value: true },
'︒': { value: true },
'🐵': { value: true },
'你好': { value: true },
'你顥': { value: true },
'😁': { value: true },
'😀': { value: true }
};

return withTestCollection(persistence, testDocs, async collectionRef => {
const orderedQuery = query(collectionRef, orderBy(documentId()));

const getSnapshot = await getDocsFromServer(orderedQuery);
const expectedDocs = [
'Sierpiński',
'Łukasiewicz',
'你好',
'你顥',
'岩澤',
'︒',
'P',
'🄟',
'🐵',
'😀',
'😁'
];
expect(toIds(getSnapshot)).to.deep.equal(expectedDocs);

const storeEvent = new EventsAccumulator<QuerySnapshot>();
const unsubscribe = onSnapshot(orderedQuery, storeEvent.storeEvent);
const watchSnapshot = await storeEvent.awaitEvent();
expect(toIds(watchSnapshot)).to.deep.equal(toIds(getSnapshot));

unsubscribe();

await checkOnlineAndOfflineResultsMatch(orderedQuery, ...expectedDocs);
});
});

// eslint-disable-next-line no-restricted-properties
(persistence.storage === 'indexeddb' ? it.skip : it)(
'snapshot listener sorts unicode strings in document key the same as server with persistence',
async () => {
const testDocs = {
'Łukasiewicz': { value: true },
'Sierpiński': { value: true },
'岩澤': { value: true },
'🄟': { value: true },
'P': { value: true },
'︒': { value: true },
'🐵': { value: true },
'你好': { value: true },
'你顥': { value: true },
'😁': { value: true },
'😀': { value: true }
};

return withTestCollection(
persistence,
testDocs,
async collectionRef => {
const orderedQuery = query(collectionRef, orderBy('value'));

const getSnapshot = await getDocsFromServer(orderedQuery);
expect(toIds(getSnapshot)).to.deep.equal([
'Sierpiński',
'Łukasiewicz',
'你好',
'你顥',
'岩澤',
'︒',
'P',
'🄟',
'🐵',
'😀',
'😁'
]);

const storeEvent = new EventsAccumulator<QuerySnapshot>();
const unsubscribe = onSnapshot(orderedQuery, storeEvent.storeEvent);
const watchSnapshot = await storeEvent.awaitEvent();
// TODO: IndexedDB sorts string lexicographically, and misses the document with ID '🄟','🐵'
expect(toIds(watchSnapshot)).to.deep.equal(toIds(getSnapshot));

unsubscribe();
}
);
}
);
});
});
Loading
Loading