Extract multi-dimensional features from Chinese characters for deep learning: phonetic features, glyph features, and structural features.
为深度学习应用提取汉字的多维特征:发音特征、字形特征、结构特征。
| Extractor / 特征器 | Description / 说明 | Example / 示例 |
|---|---|---|
| PinYinParts | Pinyin decomposition (initial, final, tone) / 拼音分解(声母、韵母、声调) | 明 → {m, ing, 2} |
| FourCorner | Four-corner encoding / 四角号码编码 | 明 → {6, 7, 0, 2, 0} |
| ChaiZi | Radical decomposition / 部首拆解 | 明 → (日, 月) |
pip install hanzi_char_featurizerfrom hanzi_char_featurizer import Featurizer
featurizer = Featurizer()
# Extract features / 提取特征
result = featurizer.extract('明天')
print(result)Output / 输出:
{
'pinyin': {
'initial': [['m'], ['t']],
'final': [['ing'], ['ian']],
'tone': [['2'], ['1']]
},
'four_corner': {
'upper_left': ['6', '1'],
'upper_right': ['7', '0'],
'lower_left': ['0', '8'],
'lower_right': ['2', '0'],
'extra': ['0', '4']
}
}# Extract features (returns dict) / 提取特征(返回 dict)
result = featurizer.extract('明天')
# Extract features (returns NumPy arrays) / 提取特征(返回 NumPy 数组)
result = featurizer.extract('明天', as_numpy=True)
# Get vocabulary / 获取词汇表
vocab = featurizer.vocabularyfrom hanzi_char_featurizer.featurizers.four_corner import FourCorner
from hanzi_char_featurizer.featurizers.pinyin_parts import PinYinParts
from hanzi_char_featurizer.featurizers.chaizi import ChaiZi
fc = FourCorner()
fc.extract('明') # {'upper_left': ['6'], 'upper_right': ['7'], ...}
pp = PinYinParts()
pp.extract('明') # {'initial': [['m']], 'final': [['ing']], 'tone': [['2']]}
cz = ChaiZi()
cz.extract('明') # {'components': [('日', '月')]}from hanzi_char_featurizer import Featurizer
from hanzi_char_featurizer.featurizers.four_corner import FourCorner
# Use only four-corner encoding / 只使用四角编码
featurizer = Featurizer(featurizers=[FourCorner()])
result = featurizer.extract('明天')- Add Unicode IDS representation from iQIYI's FASPell model / 增加 Unicode 的 IDS 表征,来自爱奇艺 FASPell 模型
MIT