Skip to content

Commit 79f27f8

Browse files
Merge pull request #176 from megagonlabs/develop
Release v5.0.0
2 parents 3c31881 + 7a138b7 commit 79f27f8

29 files changed

+3031
-315
lines changed

.gitignore

+14-5
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,23 @@
11
/bccwj*/
2+
/build/
3+
/config/ja_gsd*
4+
/corpus*/
5+
/dist/
6+
/electra*
27
/embedding*/
3-
/ja_ginza/resources/system_core.dic
4-
/kwdlc*/
8+
/ja_*
9+
/log*
10+
/megagonlabs/
511
/models/
6-
/nopn*/
712
/old/
13+
/rtx*
814
/submodules/
15+
/sudachi*
916
/target/
1017
/test/
11-
/log*
18+
/vector*
19+
/venv*
1220
__pycache__/
1321
*.pyc
14-
22+
*.egg-info/
23+
.DS_Store

README.md

+86-24
Original file line numberDiff line numberDiff line change
@@ -7,27 +7,32 @@
77

88
An Open Source Japanese NLP Library, based on Universal Dependencies
99

10-
***Please read the [Important changes](#ginza-400) before you upgrade GiNZA.***
10+
***Please read the [Important changes](#ginza-500) before you upgrade GiNZA.***
1111

1212
## License
13-
GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models are distributed under
14-
[The MIT License](https://github.com/megagonlabs/ginza/blob/master/LICENSE).
15-
You must agree and follow The MIT License to use GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models.
13+
GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models are distributed under the
14+
[MIT License](https://github.com/megagonlabs/ginza/blob/master/LICENSE).
15+
You must agree and follow the MIT License to use GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models.
1616

17-
### spaCy
17+
### Explosion / spaCy
1818
spaCy is the key framework of GiNZA.
19+
1920
[spaCy LICENSE PAGE](https://github.com/explosion/spaCy/blob/master/LICENSE)
2021

21-
### Sudachi/SudachiPy - SudachiDict - chiVe
22+
### Works Applications Enterprise / Sudachi/SudachiPy - SudachiDict - chiVe
2223
SudachiPy provides high accuracies for tokenization and pos tagging.
24+
2325
[Sudachi LICENSE PAGE](https://github.com/WorksApplications/Sudachi/blob/develop/LICENSE-2.0.txt),
24-
[SudachiPy LICENSE PAGE](https://github.com/WorksApplications/SudachiPy/blob/develop/LICENSE)
26+
[SudachiPy LICENSE PAGE](https://github.com/WorksApplications/SudachiPy/blob/develop/LICENSE),
27+
[SudachiDict LEGAL PAGE](https://github.com/WorksApplications/SudachiDict/blob/develop/LEGAL),
28+
[chiVe LICENSE PAGE](https://github.com/WorksApplications/chiVe/blob/master/LICENSE)
2529

26-
[SudachiDict LEGAL PAGE](https://github.com/WorksApplications/SudachiDict/blob/develop/LEGAL)
30+
### Hugging Face / transformers
31+
The GiNZA v5 Transformers model (ja_ginza_electra) is trained by using Hugging Face Transformers as a framework for pretrained models.
2732

28-
[chiVe LICENSE PAGE](https://github.com/WorksApplications/chiVe/blob/master/LICENSE)
33+
[transformers LICENSE PAGE](https://github.com/huggingface/transformers/blob/master/LICENSE)
2934

30-
## Training Data-sets
35+
## Training Datasets
3136

3237
### UD Japanese BCCWJ v2.6
3338
The parsing model of GiNZA v4 is trained on a part of
@@ -44,26 +49,62 @@ We use two of the named entity label systems, both
4449
and extended [OntoNotes5](https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf).
4550
This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.
4651

52+
### mC4
53+
The GiNZA v5 Transformers model (ja-ginza-electra) is trained by using [transformers-ud-japanese-electra-base-discriminator](https://huggingface.co/megagonlabs/transformers-ud-japanese-electra-base-discriminator) which is pretrained on more than 200 million Japanese sentences extracted from [mC4](https://huggingface.co/datasets/mc4).
54+
55+
Contains information from mC4 which is made available under the ODC Attribution License.
56+
```
57+
@article{2019t5,
58+
author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
59+
title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
60+
journal = {arXiv e-prints},
61+
year = {2019},
62+
archivePrefix = {arXiv},
63+
eprint = {1910.10683},
64+
}
65+
```
4766

4867
## Runtime Environment
4968
This project is developed with Python>=3.6 and pip for it.
5069
We do not recommend to use Anaconda environment because the pip install step may not work properly.
51-
(We'd like to support Anaconda in near future.)
5270

5371
Please also see the Development Environment section below.
5472
### Runtime set up
55-
#### 1. Install GiNZA NLP Library with Japanese Universal Dependencies Model
56-
Run following line
73+
74+
#### 1. Install GiNZA NLP Library with Transformer-based Model
75+
Uninstall previous version:
76+
```console
77+
$ pip uninstall ginza ja-ginza
78+
```
79+
Then, install the latest version of `ginza` and `ja-ginza-electra`:
80+
```console
81+
$ pip install -U ginza ja-ginza-electra
82+
```
83+
84+
The package of `ja-ginza-electra` does not include `pytorch_model.bin` due to PyPI's archive size restrictions.
85+
This large model file will be automatically downloaded at the first run time, and the locally cached file will be used for subsequent runs.
86+
87+
If you need to install `ja-ginza-electra` along with `pytorch_model.bin` at the install time, you can specify direct link for GitHub release archive as follows:
5788
```console
58-
$ pip install -U ginza
89+
$ pip install -U ginza https://github.com/megagonlabs/ginza/releases/download/latest/ja_ginza_electra-latest-with-model.tar.gz
5990
```
6091

61-
If you encountered some install problems related to Cython, please try to set the CFLAGS like below.
92+
If you hope to accelarate the transformers-based models by using GPUs with CUDA support, you can install `spacy` by specifying the CUDA version as follows:
6293
```console
63-
$ CFLAGS='-stdlib=libc++' pip install ginza
94+
pip install -U "spacy[cuda110]"
6495
```
6596

66-
#### 2. Execute ginza from command line
97+
#### 2. Install GiNZA NLP Library with Standard Model
98+
Uninstall previous version:
99+
```console
100+
$ pip uninstall ginza ja-ginza
101+
```
102+
Then, install the latest version of `ginza` and `ja-ginza`:
103+
```console
104+
$ pip install -U ginza ja-ginza
105+
```
106+
107+
### Execute ginza command
67108
Run `ginza` command from the console, then input some Japanese text.
68109
After pressing enter key, you will get the parsed results with [CoNLL-U Syntactic Annotation](https://universaldependencies.org/format.html#syntactic-annotation) format.
69110
```console
@@ -76,13 +117,13 @@ $ ginza
76117
4 を を ADP 助詞-格助詞 _ 3 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=ヲ
77118
5 ご ご NOUN 接頭辞 _ 6 compound _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Reading=ゴ
78119
6 一緒 一緒 VERB 名詞-普通名詞-サ変可能 _ 0 root _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Reading=イッショ
79-
7 し する AUX 動詞-非自立可能 _ 6 advcl _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=サ行変格,連用形-一般|Reading=シ
120+
7 し する AUX 動詞-非自立可能 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=サ行変格,連用形-一般|Reading=シ
80121
8 ましょう ます AUX 助動詞 _ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-マス,意志推量形|Reading=マショウ
81122
9 。 。 PUNCT 補助記号-句点 _ 6 punct _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。
82123

83124
```
84125
`ginzame` command provides tokenization function like [MeCab](https://taku910.github.io/mecab/).
85-
The output format of `ginzame` is almost same as `mecab`, but the last `pronounciation` field is always '*'.
126+
The output format of `ginzame` is almost same as `mecab`, but the last `pronunciation` field is always '*'.
86127
```console
87128
$ ginzame
88129
銀座でランチをご一緒しましょう。
@@ -159,18 +200,14 @@ The memory requirement is about 130MB/process (to be improved).
159200
Following steps shows dependency parsing results with sentence boundary 'EOS'.
160201
```python
161202
import spacy
162-
nlp = spacy.load('ja_ginza')
203+
nlp = spacy.load('ja_ginza_electra')
163204
doc = nlp('銀座でランチをご一緒しましょう。')
164205
for sent in doc.sents:
165206
for token in sent:
166207
print(token.i, token.orth_, token.lemma_, token.pos_, token.tag_, token.dep_, token.head.i)
167208
print('EOS')
168209
```
169210

170-
### APIs
171-
Please see [spaCy API documents](https://spacy.io/api/) for general analyzing functions.
172-
Or please refer the source codes of GiNZA on github until we'd write the documents.
173-
174211
### User Dictionary
175212
The user dictionary files should be set to `userDict` field of `sudachi.json` in the installed package directory of`ja_ginza_dict` package.
176213

@@ -179,6 +216,31 @@ Please read the official documents to compile user dictionaries with `sudachipy`
179216
[Sudachi User Dictionary Construction (Japanese Only)](https://github.com/WorksApplications/Sudachi/blob/develop/docs/user_dict.md)
180217

181218
## Releases
219+
220+
### version 5.x
221+
222+
#### ginza-5.0.0
223+
- 2021-08-26, Demantoid
224+
- Important changes
225+
- Upgrade spaCy to v3
226+
- Release transformer-based `ja-ginza-electra` model
227+
- Improve UPOS accuracy of the standard `ja-ginza` model by adding `morphologizer` to the tail of spaCy pipleline
228+
- Need to insrtall analysis model along with `ginza` package
229+
- High accuracy model (>=16GB memory needed)
230+
- `pip install -U ginza ja-ginza-electra`
231+
- Speed oriented model
232+
- `pip install -U ginza ja-ginza`
233+
- Change component names of `CompoundSplitter` and `BunsetuRecognizer` to `compound_splitter` and `bunsetu_recognizer` respectively
234+
- Also see [spaCy v3 Backwards Incompatibilities](https://spacy.io/usage/v3#incompat)
235+
- Improvements
236+
- Add command line options
237+
- `-n`
238+
- Force using SudachiPy's `normalized_form` as `Token.lemma_`
239+
- `-m (ja_ginza|ja_ginza_electra)`
240+
- Select model package
241+
- Revise ENE category name
242+
- `Degital_Game` to `Digital_Game`
243+
182244
### version 4.x
183245

184246
#### ginza-4.0.6

0 commit comments

Comments
 (0)