Revise documentation

caiyishu · Jan 30, 2022 · 81ccd12 · 81ccd12
1 parent 00eaae9
commit 81ccd12
Show file tree

Hide file tree

Showing 5 changed files with 175 additions and 61 deletions.
diff --git a/README.md b/README.md
@@ -26,56 +26,73 @@
     <a href="https://hanlp.hankcs.com/docs/">Docs</a> |
     <a href="https://bbs.hankcs.com/">Forum</a>
 </h4>
-The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be efficient, user friendly and extendable.
 
-Thanks to open-access corpora like Universal Dependencies and OntoNotes, HanLP 2.1 now offers 10 joint tasks on 104 languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing, constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR) parsing.
+The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing
+state-of-the-art deep learning techniques in both academia and industry. HanLP was designed from day one to be
+efficient, user-friendly and extendable.
+
+Thanks to open-access corpora like Universal Dependencies and OntoNotes, HanLP 2.1 now offers 10 joint tasks on 104
+languages: tokenization, lemmatization, part-of-speech tagging, token feature extraction, dependency parsing,
+constituency parsing, semantic role labeling, semantic dependency parsing, abstract meaning representation (AMR)
+parsing.
 
 For end users, HanLP offers light-weighted RESTful APIs and native Python APIs.
 
 ## RESTful APIs
 
-Tiny packages in several KBs for agile development and mobile applications. Although anonymous users are welcomed, an auth key is suggested and [a free one can be applied here](https://bbs.hankcs.com/t/apply-for-free-hanlp-restful-apis/3178) under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
-
- ### Python
-
-```bash
-pip install hanlp_restful
-```
-
-Create a client with our API endpoint and your auth.
-
-```python
-from hanlp_restful import HanLPClient
-HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='mul') # mul: multilingual, zh: Chinese
-```
-
-### Java
-
-Insert the following dependency into your `pom.xml`.
-
-```xml
-<dependency>
-  <groupId>com.hankcs.hanlp.restful</groupId>
-  <artifactId>hanlp-restful</artifactId>
-  <version>0.0.7</version>
-</dependency>
-```
-
-Create a client with our API endpoint and your auth.
-
-```java
-HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", null, "mul"); // mul: multilingual, zh: Chinese
-```
-
-### Quick Start
+Tiny packages in several KBs for agile development and mobile applications. Although anonymous users are welcomed, an
+auth key is suggested
+and [a free one can be applied here](https://bbs.hankcs.com/t/apply-for-free-hanlp-restful-apis/3178) under
+the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license.
+
+<details>
+  <summary>Click to expand tutorials for RESTful APIs</summary>
+
+  ### Python
+
+  ```bash
+  pip install hanlp_restful
+  ```
+
+  Create a client with our API endpoint and your auth.
+
+  ```python
+  from hanlp_restful import HanLPClient
+  HanLP = HanLPClient('https://hanlp.hankcs.com/api', auth=None, language='mul') # mul: multilingual, zh: Chinese
+  ```
+
+  ### Java
+
+  Insert the following dependency into your `pom.xml`.
+
+  ```xml
+  <dependency>
+    <groupId>com.hankcs.hanlp.restful</groupId>
+    <artifactId>hanlp-restful</artifactId>
+    <version>0.0.7</version>
+  </dependency>
+  ```
+
+  Create a client with our API endpoint and your auth.
+
+  ```java
+  HanLPClient HanLP = new HanLPClient("https://hanlp.hankcs.com/api", null, "mul"); // mul: multilingual, zh: Chinese
+
+  ```
+
+  ### Quick Start
+
+  No matter which language you use, the same interface can be used to parse a document.
+
+  ```python
+  HanLP.parse(
+      "In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments. 2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。")
+  ```
+
+  See [docs](https://hanlp.hankcs.com/docs/tutorial.html) for visualization, annotation guidelines and more details.
+
+</details>
 
-No matter which language you use, the same interface can be used to parse a document.
-
-```python
-HanLP.parse("In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments. 2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。")
-```
-
-See [docs](https://hanlp.hankcs.com/docs/tutorial.html) for visualization, annotation guidelines and more details.
 
 ## Native APIs
 
@@ -89,18 +106,22 @@ HanLP requires Python 3.6 or later. GPU/TPU is suggested but not mandatory.
 
 ```python
 import hanlp
+
 HanLP = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE)
 print(HanLP(['In 2021, HanLPv2.1 delivers state-of-the-art multilingual NLP techniques to production environments.',
              '2021年、HanLPv2.1は次世代の最先端多言語NLP技術を本番環境に導入します。',
              '2021年 HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。']))
 ```
 
-- In particular, the Python `HanLPClient` can also be used as a callable function following the same semantics. See [docs](https://hanlp.hankcs.com/docs/tutorial.html) for visualization, annotation guidelines and more details.
-- To process Chinese or Japanese, HanLP provides mono-lingual models in each language which significantly outperform the multi-lingual model. See [docs](https://hanlp.hankcs.com/docs/api/hanlp/pretrained/mtl.html) for the list of models.
+- In particular, the Python `HanLPClient` can also be used as a callable function following the same semantics.
+  See [docs](https://hanlp.hankcs.com/docs/tutorial.html) for visualization, annotation guidelines and more details.
+- To process Chinese or Japanese, HanLP provides mono-lingual models in each language which significantly outperform the
+  multi-lingual model. See [docs](https://hanlp.hankcs.com/docs/api/hanlp/pretrained/mtl.html) for the list of models.
 
 ## Train Your Own Models
 
-To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes.
+To write DL models is not hard, the real hard thing is to write a model able to reproduce the scores in papers. The
+snippet below shows how to surpass the state-of-the-art tokenizer in 6 minutes.
 
 ```python
 tokenizer = TransformerTaggingTokenizer()
@@ -124,17 +145,25 @@ tokenizer.fit(
 tokenizer.evaluate(SIGHAN2005_PKU_TEST, save_dir)
 ```
 
-The result is guaranteed to be `96.70` as the random feed is fixed. Different from some overclaiming papers and projects, HanLP promises every single digit in our scores is reproducible. Any issues on reproducibility will be treated and solved as a top-priority fatal bug.
+The result is guaranteed to be `96.70` as the random seed is fixed. Different from some overclaiming papers and
+projects, HanLP promises every single digit in our scores is reproducible. Any issues on reproducibility will be treated
+and solved as a top-priority fatal bug.
 
 ## Performance
 
+The performance of multi-task learning models is shown in the following table.
+
 <table><thead><tr><th rowspan="2">lang</th><th rowspan="2">corpora</th><th rowspan="2">model</th><th colspan="2">tok</th><th colspan="4">pos</th><th colspan="3">ner</th><th rowspan="2">dep</th><th rowspan="2">con</th><th rowspan="2">srl</th><th colspan="4">sdp</th><th rowspan="2">lem</th><th rowspan="2">fea</th><th rowspan="2">amr</th></tr><tr><th>fine</th><th>coarse</th><th>ctb</th><th>pku</th><th>863</th><th>ud</th><th>pku</th><th>msra</th><th>ontonotes</th><th>SemEval16</th><th>DM</th><th>PAS</th><th>PSD</th></tr></thead><tbody><tr><td rowspan="2">mul</td><td rowspan="2">UD2.7<br>OntoNotes5</td><td>small</td><td>98.62</td><td>-</td><td>-</td><td>-</td><td>-</td><td>93.23</td><td>-</td><td>-</td><td>74.42</td><td>79.10</td><td>76.85</td><td>70.63</td><td>-</td><td>91.19</td><td>93.67</td><td>85.34</td><td>87.71</td><td>84.51</td><td>-</td></tr><tr><td>base</td><td>98.97</td><td>-</td><td>-</td><td>-</td><td>-</td><td>90.32</td><td>-</td><td>-</td><td>80.32</td><td>78.74</td><td>71.23</td><td>73.63</td><td>-</td><td>92.60</td><td>96.04</td><td>81.19</td><td>85.08</td><td>82.13</td><td>-</td></tr><tr><td rowspan="5">zh</td><td rowspan="2">open</td><td>small</td><td>97.25</td><td>-</td><td>96.66</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>95.00</td><td>84.57</td><td>87.62</td><td>73.40</td><td>84.57</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>base</td><td>97.50</td><td>-</td><td>97.07</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>96.04</td><td>87.11</td><td>89.84</td><td>77.78</td><td>87.11</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td rowspan="3">close</td><td>small</td><td>96.70</td><td>95.93</td><td>96.87</td><td>97.56</td><td>95.05</td><td>-</td><td>96.22</td><td>95.74</td><td>76.79</td><td>84.44</td><td>88.13</td><td>75.81</td><td>74.28</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>base</td><td>97.52</td><td>96.44</td><td>96.99</td><td>97.59</td><td>95.29</td><td>-</td><td>96.48</td><td>95.72</td><td>77.77</td><td>85.29</td><td>88.57</td><td>76.52</td><td>73.76</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr><tr><td>ernie</td><td>96.95</td><td>97.29</td><td>96.76</td><td>97.64</td><td>95.22</td><td>-</td><td>97.31</td><td>96.47</td><td>77.95</td><td>85.67</td><td>89.17</td><td>78.51</td><td>74.10</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td><td>-</td></tr></tbody></table>
 
-- AMR models will be released soon.
+- Multi-task learning models often under-perform their single-task learning counterparts according to our latest
+  research. Similarly, mono-lingual models often outperform multi-lingual models. Therefore, we strongly recommend the
+  use of [a single-task mono-lingual model](https://hanlp.hankcs.com/docs/api/hanlp/pretrained/index.html) if you are
+  targeting at high accuracy instead of faster speed.
+- A state-of-the-art [AMR model](https://hanlp.hankcs.com/docs/api/hanlp/pretrained/amr.html) has been released.
 
 ## Citing
 
-If you use HanLP in your research, please cite this repository. 
+If you use HanLP in your research, please cite this repository.
 
 ```bibtex
 @inproceedings{he-choi-2021-stem,
@@ -155,11 +184,13 @@ If you use HanLP in your research, please cite this repository.
 
 ### Codes
 
-HanLP is licensed under **Apache License 2.0**. You can use HanLP in your commercial products for free. We would appreciate it if you add a link to HanLP on your website.
+HanLP is licensed under **Apache License 2.0**. You can use HanLP in your commercial products for free. We would
+appreciate it if you add a link to HanLP on your website.
 
 ### Models
 
-Unless otherwise specified, all models in HanLP are licensed under  [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). 
+Unless otherwise specified, all models in HanLP are licensed
+under  [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
 
 ## References
 

diff --git a/docs/api/hanlp/pretrained/pos.md b/docs/api/hanlp/pretrained/pos.md
@@ -1,5 +1,46 @@
+---
+jupytext:
+  formats: ipynb,md:myst
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: '0.8'
+    jupytext_version: 1.4.2
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+---
+
 # pos
 
+The process of classifying words into their **parts of speech** and labeling them accordingly is known as **part-of-speech tagging**, **POS-tagging**, or simply **tagging**. 
+
+To tag a tokenized sentence:
+
+````{margin} Batching is Faster
+```{hint}
+Tag multiple sentences at once for faster speed! 
+```
+````
+
+
+```{code-cell} ipython3
+:tags: [output_scroll]
+import hanlp
+
+pos = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ELECTRA_SMALL)
+pos(['我', '的', '希望', '是', '希望', '世界', '和平'])
+```
+
+````{margin} Custom Dictionary Supported
+```{seealso}
+See [this tutorial](https://github.com/hankcs/HanLP/blob/master/plugins/hanlp_demo/hanlp_demo/zh/demo_pos_dict.py) for custom dictionary.
+```
+````
+
+All the pre-trained taggers and their details are listed below.
+
 ```{eval-rst}
 
 .. automodule:: hanlp.pretrained.pos

diff --git a/docs/api/hanlp/pretrained/tok.md b/docs/api/hanlp/pretrained/tok.md
@@ -1,8 +1,51 @@
+---
+jupytext:
+  formats: ipynb,md:myst
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: '0.8'
+    jupytext_version: 1.4.2
+kernelspec:
+  display_name: Python 3
+  language: python
+  name: python3
+---
+
 # tok
 
+Tokenization is a way of separating a sentence into smaller units called tokens. In lexical analysis, tokens usually refer to words.
+
+To tokenize raw sentences:
+
+````{margin} Batching is Faster
+```{hint}
+Tokenize multiple sentences at once for faster speed! 
+```
+````
+
+
+```{code-cell} ipython3
+:tags: [output_scroll]
+import hanlp
+
+tok = hanlp.load(hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH)
+tok(['商品和服务。', '阿婆主来到北京立方庭参观自然语义科技公司。'])
+```
+
+All the pre-trained tokenizers and their details are listed below.
+
+
+````{margin} Custom Dictionary Supported
+```{seealso}
+See [this tutorial](https://github.com/hankcs/HanLP/blob/master/plugins/hanlp_demo/hanlp_demo/zh/demo_custom_dict.py) for custom dictionary.
+```
+````
+
 ```{eval-rst}
 
 .. automodule:: hanlp.pretrained.tok
     :members:
 
-```
+```
+
diff --git a/docs/configure.md b/docs/configure.md
@@ -26,7 +26,7 @@ If you need fine grained control over each component, ``hanlp.load(..., devices=
 See documents for :meth:`hanlp.load`.
 ```
 
-:::{seealso}
+### External Resources
 
 For deep learning beginners, you might need to learn how to set up a working GPU environment first. Here are some 
 resources.
@@ -44,8 +44,6 @@ resources.
     - In fact, you can click [![Open In Colab](https://file.hankcs.com/img/colab-badge.svg)](https://colab.research.google.com/drive/1KPX6t1y36TOzRIeB4Kt3uJ1twuj6WuFv?usp=sharing) to play with the GPU-enabled HanLP tutorial right now.
 
 
-:::
-
 ## Use Mirror Sites
 
 By default, models are downloaded from a global CDN we maintain. However, in some regions the downloading speed can 

diff --git a/plugins/hanlp_demo/hanlp_demo/zh/demo_pipeline.py b/plugins/hanlp_demo/hanlp_demo/zh/demo_pipeline.py
@@ -5,14 +5,15 @@
 
 # Pipeline allows to blend multiple callable functions no matter they are a rule, a TensorFlow component or a PyTorch
 # one. However, it's slower than the MTL framework.
-pos = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ALBERT_BASE)  # In case both tf and torch are used, load tf first
-tok = hanlp.load(hanlp.pretrained.tok.COARSE_ELECTRA_SMALL_ZH)
+# pos = hanlp.load(hanlp.pretrained.pos.CTB9_POS_ALBERT_BASE)  # In case both tf and torch are used, load tf first.
 
-pipeline = hanlp.pipeline() \
+HanLP = hanlp.pipeline() \
     .append(hanlp.utils.rules.split_sentence, output_key='sentences') \
-    .append(tok, output_key='tok') \
-    .append(pos, output_key='pos')
+    .append(hanlp.load('COARSE_ELECTRA_SMALL_ZH'), output_key='tok') \
+    .append(hanlp.load('CTB9_POS_ELECTRA_SMALL'), output_key='pos') \
+    .append(hanlp.load('MSRA_NER_ELECTRA_SMALL_ZH'), output_key='ner', input_key='tok') \
+    .append(hanlp.load('CTB9_ELECTRA_SMALL'), output_key='con', input_key='tok')
 
-doc = pipeline('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。')
+doc = HanLP('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。阿婆主来到北京立方庭参观自然语义科技公司。')
 print(doc)
 doc.pretty_print()