Skip to content

Latest commit

 

History

History
86 lines (57 loc) · 3.91 KB

NOTES.md

File metadata and controls

86 lines (57 loc) · 3.91 KB

Notes on the original formats

Working out how to convert the original files to something more usable was more difficult than I would have anticipated, so I just want to take this space to document the formats for anyone else who might need to work with these files in the future.

Encoding

In general, ASCII characters can be treated as-is when they appear in comments, colon delimiters, and whitespace. This is true of any of the JIS encodings, each of whose Japanese characters occupy a range of bytes that doesn't conflict with ASCII. However, I found that simply applying an encoding conversion to these files was unsuccessful. None of the Japanese codecs offered in Rust's encoding crate could translate a file to UTF-8 without errors. The characters being decomposed also come from different JIS character sets depending on the file, and the radicals may be in a different encoding from the kanji. This makes things a bit annoying to deal with. I've broken out below what encodings worked for me in different parts of each file. Please also read the comments in each of the files for additional grammar details. To be able to inspect the non-ASCII bytes easily, I recommend my ascii_hexdump project.

JIS X 0208

These characters are always two bytes long. I personally used a JIS X 0213 to Unicode conversion table because it seems to be compatible and it was easier to find a reference for it.

Radicals

Unfortunately, since the JIS X 0208 character set is limited, the authors in some cases had to use characters that contained the radical they wanted instead of the radical itself. They recommend a number of Unicode characters that better represent the radicals they wanted. However, some of the suggestions appear to be incorrect and others that are used by the WWWJDIC server are missing. Below are the replacements used by this library. Thanks to Jisho for some excellent alternate characters they found.

  • 化 -> ⺅
  • 个 -> 𠆢
  • 并 -> 丷
  • 刈 -> ⺉
  • 込 -> ⻌
  • 尚 -> ⺌
  • 忙 -> ⺖
  • 扎 -> 扌
  • 汁 -> ⺡
  • 犯 -> ⺨
  • 艾 -> ⺾
  • 邦 -> ⻏
  • 阡 -> ⻙
  • 阡 -> ⻖
  • 老 -> ⺹
  • 杰 -> ⺣
  • 礼 -> ⺭
  • 疔 -> ⽧
  • 禹 -> ⽱
  • 初 -> ⻂
  • 買 -> ⺲
  • 滴 -> 啇
  • 乞 -> 𠂉

JIS X 0212

Where they stand alone, these characters are three bytes long. The JIS X 0212 to Unicode conversion table I found does not seem to work on these. Instead, the EUC-JP codec works. JIS X 0212 is also used for the kanji in the radkfile, but each character is not necessarily 3 bytes long and they aren't whitespace delimited. Just take those entire lines and run them through the EUC-JP codec rather than trying to split out individual characters by hand.

kradfile

Each pair of two bytes is in JIS X 0208.

�� : �� �� ��

kradfile2

The first three bytes, the kanji, are from JIS X 0212. The remaining two-byte radical characters are in JIS X 0208.

��� : �� �� ��

radkfile and radkfile2

The two bytes identifying the radical are in JIS X 0208. Glyph alternate representations are four characters in the set [0-9A-Z]. These are given as JIS X 0212 characters in hexadecimal, but unlike the other JIS X 0212 characters used in these files, the EUC-JP codec does not work. You must use the JIS X 0212 to Unicode conversion table instead. If the alternate representation does not match the hexadecimal form, it corresponds to an image from the WWWJDIC server, for example js02.

# No alternate representation
$ �� 1
# Glyph alternate representation
$ �� 3 4A6D
# Image alternate representation
$ �� 2 js01

The lines following radical identifiers are in JIS X 0212.

�����������������...