pinyin-sort

A small Rust CLI that sorts Chinese strings by their Hanyu Pinyin (tone3) order, with sensible tie‑breaking by the original character and flexible output formatting. It can read input from files or directly from command‑line text arguments. A simple TOML override file lets you correct or customize pinyin for specific characters or phrases.

Note

This repository generates a large static map from codepoint to pinyin at build time from the vendored pinyin-data source.
Pinyin syllables are normalized to tone3 style (e.g., han4, zhao4).

Features

Sort a list of Chinese strings by pinyin
Deterministic tie‑breaking by original character when pinyin matches
Accept input via files or inline text
Highly configurable output formatting (columns, alignment, padding, separators, blank line cadence)
Optional pinyin override file (TOML) for characters and phrases
Reproducible development environment with Nix and Just tasks

Installation

From source (Cargo)

Prerequisites:

Rust toolchain (cargo + rustc)

Steps:

Prepare data (convert vendored pinyin list to CSV):
- If you do NOT use Nix:
  - Ensure you have Python 3 and the pypinyin package installed: pip install pypinyin
  - Run: python3 scripts/convert_pinyin_to_csv.py
- If you use Nix: see the Nix section below or simply run: just prep-data
Build:
- cargo build --release
The binary will be at:
- target/release/pinyin-sort

With Nix

This repo includes a flake and a development shell.

Enter the dev shell (provides rustup, cargo, python, pypinyin, just):
- nix develop
Prepare data:
- just prep-data
Build (using Nix):
- just build
- or nix build
Binary location when building via Cargo inside the dev shell:
- target/release/pinyin-sort

Usage

Basic help:

pinyin-sort -h

Inputs are provided either as files or inline text. If neither is provided, the tool prints its help and exits.

Examples:

Sort two inline strings and print as a table with defaults:
- pinyin-sort -t 汉字张三赵四
Sort lines from a file:
- pinyin-sort -f ./data.txt
Sort multiple files and override output layout:
- pinyin-sort -f a.txt b.txt --columns 5 --entry-width 6 --align center --separator ","

Behavior overview:

The program converts each string to a vector of pinyin syllables (tone3). It compares the first pinyin of each character in order. If syllables at a position are equal, it falls back to comparing the original characters so that, for example, 赵 sorts after 照 when pronunciations match. If all compared syllables match, shorter strings come first.

Exit codes:

0 on success
Non‑zero on I/O or configuration parsing errors (e.g., reading files, loading override TOML)

CLI options

These options are defined in src/args.rs and parsed via clap.

Inputs

-f, --file Input file path (can be passed multiple times)
-t, --text ... Inline text data (can be passed multiple times)

Output destination

Currently outputs to stdout. Note: The flag -o/--output is defined in CLI args but not yet wired; output redirection via shell is recommended for now.

Pinyin overrides

-c, --config TOML file with override rules (see below)

Formatting

--columns Number of entries per row (default: 6)
--blank-every Insert a blank line every N rows (default: 7)
--entry-width Pad each entry to this width (default: 4)
--align One of: left, center, right, even (default: center)
--padding-char Character for padding (default: space)
--separator Entry separator (default: tab)
--line-ending Line ending (default: \n)

Note: When using shell characters like tab or newline on the command line, ensure they are quoted or escaped appropriately for your shell.

Override configuration (TOML)

You can customize pinyin for specific characters or phrases. Provide a TOML file via --config.

Schema (see src/override.rs):

char_override: map from single char to a single pinyin string
phrase_override: map from full phrase (string) to an array of pinyin strings, one per character

Example override.toml:

[char_override] '重' = "chong2" '行' = "xing2"

[phrase_override] "重庆" = ["chong2", "qing4"] "银行" = ["yin2", "hang2"]

Usage:

pinyin-sort -t 重庆 -t 重庆市 --config ./override.toml

Notes:

phrase_override takes precedence when the full input matches a phrase key.
For characters not listed in the overrides, built‑in data is used.

Data and build process

Generated file:

src/generated/pinyin_map.rs is generated at build time by build.rs from data/pinyin.csv using phf (perfect hash function) for fast lookups.

Data preparation:

The project vendors OpenChinese convert data under vendor/pinyin-data/pinyin.txt.
scripts/convert_pinyin_to_csv.py transforms the vendored pinyin data into data/pinyin.csv and normalizes to tone3 using pypinyin.

Build steps:

Ensure data/pinyin.csv exists (create it via the script above).
Run cargo build (or cargo build --release). The build script regenerates src/generated/pinyin_map.rs when data/pinyin.csv changes.

Programmatic use

The library code includes simple helpers:

pinyin::pinyin_of(&str) -> Vec
sort::sort_by_pinyin(Vec<T: ToString>) -> Vec
format::{format, format_cell, FormatConfig}

Caveats:

The pinyin_of function relies on generated data and optional overrides. It returns per‑character pinyin (first reading or the override for the position).

Development

Tests: run cargo test
Dev shell (Nix): nix develop
Just recipes: just prep-data, just build

License

AGPL-3.0-only. See Cargo.toml.

Acknowledgements

pinyin-data (https://github.com/mozillazg/pinyin-data) for the source data under vendor/pinyin-data.
pypinyin for tone conversion in the preprocessing script.
phf for fast compile‑time maps.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
data		data
scripts		scripts
src		src
vendor		vendor
.editorconfig		.editorconfig
.gitignore		.gitignore
.gitmodules		.gitmodules
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
build.rs		build.rs
flake.lock		flake.lock
flake.nix		flake.nix
justfile		justfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pinyin-sort

Features

Installation

From source (Cargo)

With Nix

Usage

CLI options

Override configuration (TOML)

Data and build process

Programmatic use

Development

License

Acknowledgements

About

Uh oh!

Releases 1

Packages

Languages

License

Acture/pinyin-sort

Folders and files

Latest commit

History

Repository files navigation

pinyin-sort

Features

Installation

From source (Cargo)

With Nix

Usage

CLI options

Override configuration (TOML)

Data and build process

Programmatic use

Development

License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages