An introduction to font fallback. Primarily aimed at people working on Google Fonts. The reader is assumed to have basic familiarity with Font 101.
Imagine we offer email, chat, or any other service where user-entered text can be displayed. Users will probably expect more than basic latin to work. We need to strive to be able to display any valid sequence of unicode codepoints correctly.
We typically get a clue to the language from at least one of app/web developer, browser, or operating system.
So, given (language, codepoint sequence)
we need to be able to produce something sufficient to render the text.
We take that to mean breaking the input text into one or more runs of text that are to be drawn with a specific font.
The goal of this document is to explain enough of the rudiments of this problem to build a toy solution.
Raph's talk on Android Typography (youtube) is well worth a watch for context.
It is implausible for a single font, limited to 65k chars, to support all the worlds languages. We’re going to need a bunch of fonts. If we have multiple fonts we'll also need rules for how to choose which one should be used to render a given unit of text. Let's call (fonts, rules)
a font configuration.
We can now specify our problem a bit more concretely:
Input: (font configuration, language, codepoint sequence)
Output: sequence of (font, codepoint sequence)
Building an entirely new set of fonts for most or all of Unicode is a big job. Thankfully Android is open source and has both a configuration (fonts.xml) and a set of open source fonts.
We could just use Androids entire text stack but that wouldn’t leave us anything to play with!
It is tempting to think of codepoint as meaning a user perceived character. Unfortunately this isn't at all true:
- A medium skin tone woman with red hair is a single user perceived character, 4 codepoints
- Emojipedia woman medium skin tone red hair
Ìṣọ̀lá
is 5 user perceived characters, 6 codepointsọ̀
is two codepoints: latin small o with dot below, combining grave accent- Combining marks
Our results will be much better if we try to ensure entire user perceived characters to come from the same font. "user perceived character" is clumsy, let’s use "grapheme" (“The smallest meaningful contrastive unit in a writing system.”, Oxford) or "grapheme cluster."
That means we want to iterate over the grapheme clusters and pick the best font for each cluster. It's easy to loop over the codepoints in a string. Looping over graphemes is harder. Thankfully Unicode has a detailed desription of how to approach this in Annex #29 "Unicode Text Segmentation" (tr29). Even better, International Components for Unicode (ICU, http://icu-project.org/) provides an implementation.
We now have enough we can start to think about implementing a fallback system. Pick a programming language and write a toy implementation! A few tips and reminders:
- Android has a set of fonts and a configuration defining how to prioritize them
- I have gathered examples by Android API level here
- Read the comment on
fonts.xml
carefully, the prioritization of fonts (first match by lang, then by order) is critical.
- ICU provides us
BreakIterator
, making it easy to loop over grapheme clusters in our input text- unicode_segmentation provides grapheme, word, and sentence breaking for Rust
- ICU4J for Java and PyICU are available as well
# PyICU can be grumpy about installation; this worked for me on Mac using Homebrew in a py3 venv brew install icu4c export PATH="/usr/local/opt/icu4c/bin:$PATH" pip install pyicu
- Skrifa can give you the codepoints through charmap
- Merge adjacent clusters using the same font to form runs
That should be enough to implement what we wanted at the beginning:
Input: (font configuration, language, codepoint sequence)
Output: sequence of (font, codepoint sequence)
Note that this is over-simplified but perhaps enough to give some feel for the problem.