Skip to content

Commit

Permalink
How to tag text for English
Browse files Browse the repository at this point in the history
  • Loading branch information
apmoore1 committed Jun 2, 2022
1 parent 1d1ffb8 commit 8166391
Show file tree
Hide file tree
Showing 4 changed files with 86 additions and 3 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- The documentation now has a `How-to` guide on `Tag CoNLL-U Files`.
- The documentation now has a `How-to Tag Text` guide for Finnish and English.

## [v0.3.0](https://github.com/UCREL/pymusas/releases/tag/v0.3.0) - 2022-05-04

Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@

## Language support

PyMUSAS currently support 9 different languages with pre-configured spaCy components that can be downloaded, each language has it's own [guide on how to tag text using PyMUSAS](https://ucrel.github.io/pymusas/usage/how_to/tag_text). Below we show the languages supported, if the model for that language supports Multi Word Expression (MWE) identification and tagging (all languages support token level tagging by default), and size of the model:
PyMUSAS currently support 10 different languages with pre-configured spaCy components that can be downloaded, each language has it's own [guide on how to tag text using PyMUSAS](https://ucrel.github.io/pymusas/usage/how_to/tag_text). Below we show the languages supported, if the model for that language supports Multi Word Expression (MWE) identification and tagging (all languages support token level tagging by default), and size of the model:

| Language (BCP 47 language code) | MWE Support | Size |
| --- | --- | --- |
Expand All @@ -51,6 +51,7 @@ PyMUSAS currently support 9 different languages with pre-configured spaCy compon
| Italian (it) | :heavy_check_mark: | 0.50MB |
| Dutch, Flemish (nl) | :x: | 0.15MB |
| Portuguese (pt) | :heavy_check_mark: | 0.27MB |
| English (en) | :heavy_check_mark: | 0.88MB |

## Install PyMUSAS

Expand Down
3 changes: 2 additions & 1 deletion docs/docs/usage/getting_started/intro.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ sidebar_position: 1

**Py**thon **M**ultilingual **U**crel **S**emantic **A**nalysis **S**ystem, is a rule based token and Multi Word Expression (MWE) semantic tagger. The tagger can support any semantic tagset, however the tagset we have concentrated on and released pre-configured spaCy components for is the [Ucrel Semantic Analysis System (USAS)](https://ucrel.lancs.ac.uk/usas/).

PyMUSAS currently support 9 different languages with pre-configured spaCy components that can be downloaded, each language has it's own [guide on how to tag text using PyMUSAS](/usage/how_to/tag_text). Below we show the languages supported, if the model for that language supports MWE identification and tagging (all languages support token level tagging by default), and size of the model:
PyMUSAS currently support 10 different languages with pre-configured spaCy components that can be downloaded, each language has it's own [guide on how to tag text using PyMUSAS](/usage/how_to/tag_text). Below we show the languages supported, if the model for that language supports MWE identification and tagging (all languages support token level tagging by default), and size of the model:

| Language (BCP 47 language code) | MWE Support | Size |
| --- | --- | --- |
Expand All @@ -21,6 +21,7 @@ PyMUSAS currently support 9 different languages with pre-configured spaCy compon
| Italian (it) | :heavy_check_mark: | 0.50MB |
| Dutch, Flemish (nl) | :x: | 0.15MB |
| Portuguese (pt) | :heavy_check_mark: | 0.27MB |
| English (en) | :heavy_check_mark: | 0.88MB |

## Reading the documentation

Expand Down
82 changes: 81 additions & 1 deletion docs/docs/usage/how_to/tag_text.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ In this guide we are going to show you how to tag text using the PyMUSAS [RuleBa
2. Download and use a Natural Language Processing (NLP) pipeline that will tokenise, lemmatise, and Part Of Speech (POS) tag. In most cases this will be a spaCy pipeline. **Note** that the PyMUSAS `RuleBasedTagger` only requires at minimum the data to be tokenised but having the lemma and POS tag will improve the accuracy of the tagging of the text.
3. Run the PyMUSAS `RuleBasedTagger`.
4. Extract token level linguistic information from the tagged text, which will include USAS semantic tags.
5. For Chinese, Italian, Portuguese, Spanish, and Welsh taggers which support Multi Word Expression (MWE) identification and tagging we will show how to extract this information from the tagged text as well.
5. For Chinese, Italian, Portuguese, Spanish, Welsh, and English taggers which support Multi Word Expression (MWE) identification and tagging we will show how to extract this information from the tagged text as well.


## Chinese
Expand Down Expand Up @@ -1059,4 +1059,84 @@ bayar bayar VB ['Z99']

</details>

</details>

## English

<details>
<summary>Expand</summary>

First download both the [English PyMUSAS `RuleBasedTagger` spaCy component](https://github.com/UCREL/pymusas-models/releases/tag/en_dual_none_contextual-0.3.1) and the [small English spaCy model](https://spacy.io/models/en):

``` bash
pip install https://github.com/UCREL/pymusas-models/releases/download/en_dual_none_contextual-0.3.1/en_dual_none_contextual-0.3.1-py3-none-any.whl
python -m spacy download en_core_web_sm
```

Then create the tagger, in a Python script:

``` python
import spacy

# We exclude the following components as we do not need them.
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner'])
# Load the English PyMUSAS rule based tagger in a separate spaCy pipeline
english_tagger_pipeline = spacy.load('en_dual_none_contextual')
# Adds the English PyMUSAS rule based tagger to the main spaCy pipeline
nlp.add_pipe('pymusas_rule_based_tagger', source=english_tagger_pipeline)
```

The tagger is now setup for tagging text through the spaCy pipeline like so (this example follows on from the last). The example text is taken from the English Wikipedia page on the topic of [`The Nile River`](https://en.wikipedia.org/wiki/Nile), we captilised the *n* in `Northeastern`:

``` python
text = "The Nile is a major north-flowing river in Northeastern Africa."

output_doc = nlp(text)

print(f'Text\tLemma\tPOS\tUSAS Tags')
for token in output_doc:
print(f'{token.text}\t{token.lemma_}\t{token.pos_}\t{token._.pymusas_tags}')
```

<details>

<summary>Output:</summary>

``` tsv
Text Lemma POS USAS Tags
The the DET ['Z5']
Nile Nile PROPN ['Z2']
is be AUX ['A3+', 'Z5']
a a DET ['Z5']
major major ADJ ['A11.1+', 'N3.2+']
north north NOUN ['M6']
- - PUNCT ['PUNCT']
flowing flow VERB ['M4', 'M1']
river river NOUN ['W3/M4', 'N5+']
in in ADP ['Z5']
Northeastern Northeastern PROPN ['Z1mf', 'Z3c']
Africa Africa PROPN ['Z1mf', 'Z3c']
. . PUNCT ['PUNCT']
```
</details>

For English the tagger also identifies and tags Multi Word Expressions (MWE), to find these MWE's you can run the following:

``` python
print(f'Text\tPOS\tMWE start and end index\tUSAS Tags')

for token in output_doc:
start, end = token._.pymusas_mwe_indexes[0]
if (end - start) > 1:
print(f'{token.text}\t{token.pos_}\t{(start, end)}\t{token._.pymusas_tags}')
```

Which will output the following:

``` tsv
Text POS MWE start and end index USAS Tags
Northeastern PROPN (10, 12) ['Z1mf', 'Z3c']
Africa PROPN (10, 12) ['Z1mf', 'Z3c']
```

</details>

0 comments on commit 8166391

Please sign in to comment.