Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
thatbudakguy committed Jan 15, 2024
1 parent f42fd40 commit 106be06
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 8 deletions.
11 changes: 8 additions & 3 deletions core_inception/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,12 @@

This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy.

To get started, clone this project using Weasel:
`spacy project clone --repo https://github.com/New-Languages-for-NLP/project-templates.git my_project_name`

Then, follow the instructions in the README in the assets directory to set up your project's assets.


## 📋 project.yml

The [`project.yml`](project.yml) defines the data assets required by the
Expand All @@ -20,7 +26,6 @@ Commands are only re-run if their inputs have changed.
| --- | --- |
| `install-dependencies` | Install python dependencies |
| `install-language` | Install the language module from Cadet |
| `validate-annotations` | Validate the files exported from INCEpTION |
| `convert-raw-text` | Convert raw text files to spaCy's format |
| `convert-annotations` | Convert annotated data from INCEpTION to spaCy's format |
| `split-data` | Split the data into training, validation, and test sets |
Expand All @@ -40,9 +45,9 @@ inputs have changed.

| Workflow | Steps |
| --- | --- |
| `all` | `install-dependencies` → `install-language` → `validate-annotations` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` → `debug-config` → `pretrain-model` → `train-model` |
| `all` | `install-dependencies` → `install-language` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` → `debug-config` → `pretrain-model` → `train-model` |
| `install` | `install-dependencies` → `install-language` |
| `setup` | `validate-annotations` → `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` |
| `setup` | `convert-raw-text` → `convert-annotations` → `split-data` → `debug-data` |
| `train` | `debug-config` → `pretrain-model` → `train-model` |

### 🗂 Assets
Expand Down
24 changes: 20 additions & 4 deletions core_inception/assets/README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,33 @@
# Project Assets

## Annotations (`/annotations`)

This directory contains annotated data exported from INCEpTION.

Data for most linguistic layers is stored in the [CoNLL-U format](https://universaldependencies.org/format.html), with one token per line and blank lines separating sentences. Each annotated text is stored in a single file with the `.conllu` extension, by convention.
Data for most linguistic layers is stored in the [CoNLL-U format](https://universaldependencies.org/format.html), with one token per line and blank lines separating sentences. Each annotated text is stored in a single file with the `.conllu` extension.

If you have named entity annotations and wish to combine them with your other syntactic annotations, you can additionally export the [CoNLL-2002 NER format](https://www.clips.uantwerpen.be/conll2002/ner/) for each file. These files will have the `.conll` extension. After export, you can use the `merge_annotations` script to add the NER annotations to the `MISC` column of your CoNLL-U files, for example:

Because CoNLL-U does not support named entity annotation without a custom extension, named entity annotations are stored in the simpler [IOB format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Each annotated text is stored in a single file with the `.conll` extension, similar to [CoNLL-2002 data](https://www.cnts.ua.ac.be/conll2002/ner/).
```bash
python scripts/merge_annotations.py annotations/my_text.conllu annotations/my_text.conll > merged.conllu
```

For more information, try:

When the data is converted into spaCy's binary format, any `.conllu` and `.conll` files with the same base name will be joined together into a single collection of documents. For example, `my_text.conllu` and `my_text.conll` will be joined together into a single collection of documents named `my_text`. **If the filenames differ, the data will be treated as separate documents, which will impact your model's accuracy.**
```bash
python scripts/merge_annotations.py --help
```

The included examples are annotated data from Project Gutenberg; see section on the [text](../text) directory below for more information. This example data was annotated automatically and is not intended to be used for training a real model.

## Language Module (`/lang`)

This directory contains the language module exported from Cadet.

The language module needs to be installable via `pip`, so it must include (at a minimum) a `setup.py` file and a `__init__.py` file. The `setup.py` file uses spaCy's entry points to register the language with spaCy.

The module should have a directory structure like this:

```
lang
├── zxx
Expand All @@ -25,19 +38,22 @@ lang
```

**Replace the contents of this directory with your own language module**, renaming the directories labeled `zxx` to your [ISO-639 language code](https://www.loc.gov/standards/iso639-2/php/code_list.php). Then:

- change the value of the `lang` variable in `project.yml` to your language code
- change the value of `[nlp.lang]` in `configs/config.cfg` to your language code

When you run `spacy project run install-language`, spaCy will install your language module as a Python package, and register it with spaCy.

## Raw Text (`/text`)

This directory contains two example texts from Project Gutenberg:

- _A Muramasa blade: A story of feudalism in old Japan_ by Louis Wertheimber (1887) - [muramasa.txt](muramasa.txt)
- _The Vanguard of Venus_ by Landell Bartlett (1944) - [vanguard.txt](vanguard.txt)

For the license governing the use of these texts, see [LICENSE](LICENSE).

You can use plain text (`.txt`) files like this to pre-train your language model.
You can use plain text (`.txt`) files like this to pre-train your language model.

**Replace these texts with ones from your target language.**

Expand Down
8 changes: 7 additions & 1 deletion core_inception/project.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
title: "Train new language core model with Cadet and INCEpTION"
description: "This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy."
description: |
This project template lets you train a part-of-speech tagger, dependency parser, and named entity recognizer for a new language from your Cadet and INCEpTION data. It includes configuration for pretraining your model on raw text to improve its accuracy.
To get started, clone this project using Weasel:
`spacy project clone --repo https://github.com/New-Languages-for-NLP/project-templates.git my_project_name`
Then, follow the instructions in the README in the assets directory to set up your project's assets.
# Variables can be referenced across the project.yml using ${vars.var_name}
vars:
Expand Down

0 comments on commit 106be06

Please sign in to comment.