Hanzi Char Featurizer / 汉字字符特征提取器

Extract multi-dimensional features from Chinese characters for deep learning: phonetic features, glyph features, and structural features.

为深度学习应用提取汉字的多维特征：发音特征、字形特征、结构特征。

Feature Extractors / 特征提取器

Extractor / 特征器	Description / 说明	Example / 示例
PinYinParts	Pinyin decomposition (initial, final, tone) / 拼音分解（声母、韵母、声调）	`明` → `{m, ing, 2}`
FourCorner	Four-corner encoding / 四角号码编码	`明` → `{6, 7, 0, 2, 0}`
ChaiZi	Radical decomposition / 部首拆解	`明` → `(日, 月)`

Installation / 安装

pip install hanzi_char_featurizer

Quick Start / 快速开始

from hanzi_char_featurizer import Featurizer

featurizer = Featurizer()

# Extract features / 提取特征
result = featurizer.extract('明天')
print(result)

Output / 输出：

{
    'pinyin': {
        'initial': [['m'], ['t']],
        'final': [['ing'], ['ian']],
        'tone': [['2'], ['1']]
    },
    'four_corner': {
        'upper_left': ['6', '1'],
        'upper_right': ['7', '0'],
        'lower_left': ['0', '8'],
        'lower_right': ['2', '0'],
        'extra': ['0', '4']
    }
}

API

# Extract features (returns dict) / 提取特征（返回 dict）
result = featurizer.extract('明天')

# Extract features (returns NumPy arrays) / 提取特征（返回 NumPy 数组）
result = featurizer.extract('明天', as_numpy=True)

# Get vocabulary / 获取词汇表
vocab = featurizer.vocabulary

Using Individual Extractors / 单独使用特征器

from hanzi_char_featurizer.featurizers.four_corner import FourCorner
from hanzi_char_featurizer.featurizers.pinyin_parts import PinYinParts
from hanzi_char_featurizer.featurizers.chaizi import ChaiZi

fc = FourCorner()
fc.extract('明')  # {'upper_left': ['6'], 'upper_right': ['7'], ...}

pp = PinYinParts()
pp.extract('明')  # {'initial': [['m']], 'final': [['ing']], 'tone': [['2']]}

cz = ChaiZi()
cz.extract('明')  # {'components': [('日', '月')]}

Custom Extractor Combination / 自定义特征器组合

from hanzi_char_featurizer import Featurizer
from hanzi_char_featurizer.featurizers.four_corner import FourCorner

# Use only four-corner encoding / 只使用四角编码
featurizer = Featurizer(featurizers=[FourCorner()])
result = featurizer.extract('明天')

Companies Using This / 在使用的公司

TODO

Add Unicode IDS representation from iQIYI's FASPell model / 增加 Unicode 的 IDS 表征，来自爱奇艺 FASPell 模型

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
hanzi_char_featurizer		hanzi_char_featurizer
image		image
tests		tests
usage		usage
.gitignore		.gitignore
DEVELOP.md		DEVELOP.md
LICENSE.txt		LICENSE.txt
README.md		README.md
example_code.py		example_code.py
makefile		makefile
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hanzi Char Featurizer / 汉字字符特征提取器

Feature Extractors / 特征提取器

Installation / 安装

Quick Start / 快速开始

API

Using Individual Extractors / 单独使用特征器

Custom Extractor Combination / 自定义特征器组合

Companies Using This / 在使用的公司

TODO

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

howl-anderson/hanzi_char_featurizer

Folders and files

Latest commit

History

Repository files navigation

Hanzi Char Featurizer / 汉字字符特征提取器

Feature Extractors / 特征提取器

Installation / 安装

Quick Start / 快速开始

API

Using Individual Extractors / 单独使用特征器

Custom Extractor Combination / 自定义特征器组合

Companies Using This / 在使用的公司

TODO

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages