Skip to content

Commit

Permalink
Update documentation and PM preprocessing libraries.
Browse files Browse the repository at this point in the history
  • Loading branch information
khituras committed Mar 1, 2024
1 parent 1b1b681 commit c964440
Show file tree
Hide file tree
Showing 15 changed files with 232 additions and 12 deletions.
24 changes: 22 additions & 2 deletions gepi/gepi-preprocessing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,16 @@
The NLP pipelines in this directory perform the linguistic processing and, eventually, the extraction of interactions between genes, gene products, gene groups and gene families. Since PubMed Central documents are much larger and more complex than PubMed abstracts, there are a few small differences in the processing pipelines - e.g. batch sizes, max sentence lengths etc. - which is why there are two very similar pipelines located in the respective subdirectories, `pmc` and `pubmed`. The largest difference is the reader component that parses the respective XML formats. Apart from this, the pipelines are very similar and perform the same processing steps.

## UIMA pipeline basics
These information are not required to run the pipelines. However, they help to understand the contents of the pipeline directories and the NLP processing design.
This information is not required to run the pipelines. However, they help to understand the contents of the pipeline directories and the NLP processing design.

Both pipelines are [UIMA](https://uima.apache.org/) pipelines. This means that the pipelines are comprised of a series of processing components, e.g. sentence boundary detection, tokenization, abbreviation resolution, gene tagging and interaction extraction. These components are called **annotators** in UIMA. Each annotator has an **XML descriptor** stored in the `desc` and `descAll` subdirectories. The descriptor holds metadata of each annotator such as its name, description, input and output annotation types, parameters and others. Each annotator also consists of Java code that realizes the processing logic (the annotator code is not contained in this repository; see [Preparing the pipelines to run](#preparing-the-pipelines-to-run) for more information). For example, the tokenizer component expects that it is preceded by a sentence splitter in the pipeline. It reads the sentences, applies an ML model to split the sentence into tokens and returns the tokens. The descriptor specifies where the model is found and whether to use actually expect sentences or just tokenize the whole document text.

UIMA uses a sophisticated type system formalism to specify annotation types like sentences and tokens. In their most simple form, annotation types are just named types - e.g. `Sentence` - whose instances span characters in the document text. However, types can have attributes called `features` that express properties of type instances. Types can also extend other types which allows for taxonomic type systems. For example, the JulieLab types `de.julielab.jcore.types.Gene` extends `de.julielab.jcore.types.BioEntityMention` and inherits a feature named `species` which may refer to the species name that `BioEntityMention` refers to.

The document text and its annotations a held in a Common Analysis System (CAS) object. That is basically a container for the text, its annotation, annotation indices for efficient access to annotations are more functionality. Each annotator receives the CAS of a documents, reads required input and writes its output into the CAS. Thus, the CAS is the UIMA object that is passed through the pipeline components and always holds the current processing state until it can be consumed at the end of the pipeline into some external output.

**Aggregate analysis engines (AAE)** are UIMA structures that contains a list of other analysis engines/annotators and represent themselves as a single annotator. This is useful to group annotators that are functionally dependent of each other or just to bundle components together.

## JCoRe basics
The annotator components used for the pipelines come from [JCoRe](https://github.com/JULIELab/jcore-base), the JULIE Lab Component Repository. All components in this repository are UIMA components tailored for our own type system. To view, edit and use the GePI pipelines, no deeper knowledge of JCoRe is required. By using the JCoRe Pipeline Builder ([see below](#viewing-and-editing-the-pipelines)), direct access to the JCoRe components is established.

Expand Down Expand Up @@ -48,11 +50,29 @@ Importing and preprocessing of the data happens in the following steps:
6. Successfully running a pipeline will create a table `_data_xmi.documents` in the respective PostgreSQL database. This table contains the document itself in the UIMA serialization format XMI as well as the annotations. The indexing pipeline reads from this table.

## Details and tips for running the pipelines

### Scaling the pipelines
- The pipelines scale vertically by specifying the amount of threads in the `run.xml` file. The pipelines are thread safe and have been run with 10 to 24 threads.
- The pipelines scale horizontally by running the very same pipeline on multiple machines that all have access to the PostgreSQL DBMS. This is where the subset tables from above come into play:
- The data import creates the table `_data._data`, i.e. the `_data` table in the schema `_data`. The XML contents are stored in this table one document per row.
- While it is possible to read from this so-called *data table* directly, the usage of a subset table is recommended due to some quality of life features that come along with it. **For horizontal scaleout a subset table is required**.
- A subset table is a CoStoSys concept and serves two main purposes:
1. Represent a subset of a data table, i.e. a corpus, by referencing document IDs in the data table via foreign keys; the XML data is not duplicated.
2. Subset tables have a specific column schema that contains information about the state of processing for each document, i.e. the boolean columns `in_process` and `is_processed`. This allows to query the state of processing with CoStoSys, e.g. `java -jar costosys.jar -st _data._data_mirror` where `-st` stands for *status*. Call `java -jar costosys.jar` without parameters for the full list of functions. See the [CoStoSys](https://github.com/JULIELab/costosys) for details on the concepts.
3. Through the processing status, subset tables **synchronize multiple pipelines** accessing the same CoStoSys database. Each pipeline only reads documents that have not yet been marked as being `in_process`.
3. Through the processing status, subset tables **synchronize multiple pipelines** accessing the same CoStoSys database. Each pipeline only reads documents that have not yet been marked as being `in_process`.

### Detection of text changes in updated documents
The pipelines use a mechanism to avoid re-processing of a document that has been updated in PubMed or PMC but whose text contents have not changed. For this purpose, the sha256 hash of the document text is saved in the `_data_xmi.documents` table where the preprocessing pipelines store their results. The hash can be instrumented to determine if a document currently in preprocessing, firstly, already exists in the database and, secondly, has changed text contents in comparison to the database. If *no change in the text contents* is detected we can basically skip the preprocessing and save time. This happens when only metadata has changed, for example. This helps to keep the interaction storage up-to-date with updated documents through MEDLINE or PMC update files. Those update files often contain documents that had been added before but, e.g. in MEDLINE, with an updated status or some metadata addition, MeSH terms etc.

The detection of text changes happens in the `JCoRe GNormPlus PubMed Database Multiplier` component. A number of parameters has influence on this functionality:

| Parameter Name | Parameter Type | Mandatory | Multivalued | Description |
|----------------|----------------|-----------|-------------|-------------|
| SkipUnchangedDocuments | Boolean | false | false | Whether GNormPlus should run on documents that exist in the database and have unchanged text contents. |
| AddToVisitKeys | Boolean | false | false | Whether to add the value of the `ToVisitKeys` parameter to the CAS. See description of `ToVisitKeys` for more details. |
| AddUnchangedDocumentTextFlag | Boolean | false | false | Whether to set a flag in the CAS that stores the information if the document text has changed in comparison to a potentially existing document in the database with the same document ID. Used by downstream components like the XMI DB Writer to manage actions depending on (un)changed text contents. |
| ToVisitKeys | String[] | false | true | The UIMA-aggregate-keys of the pipeline components that should still be visited if the document text is unchanged in comparison to a potentially existing document in the database with the same ID. All other documents will be skipped **if** a `JCoRe Annotation Defined Flow Controller` is present as the first annotator and/or first consumer in the JCoRe pipeline. The flow controller looks at the `ToVisitKeys` and routes the CAS accordingly, skipping all components whose key is not in the list. In CoStoSys pipelines we want to run the DB Checkpoint AE in any way, unchanged document text or not, because we want to keep track of all documents if they have already been processed by the pipeline. |

In the GePI pipelines, these parameters set to avoid the re-processing of documents that are already stored in the database with the same text contents and there is no particular need to change them. But if you do perform changes, take care to stay consistent `SkipUnchangedDocuments` should only be enabled if all the other parameters are also set to values that skip all components that would need the GNormPlus annotations like, for example, BioSem for interaction extraction.

Also note the importance - as mentioned in the description of the `ToVisitKeys` parameter above - of the `JCoRe Annotation Defined Flow Controller` component. This component evaluates the `ToVisitKeys` that have been set in the multiplier to route the CAS through their aggregate analysis engine (described in <a href="#uima-pipeline-basics">UIMA basics</a> above; used in JCoRe pipelines to bundle all annotators in one AAE and all consumer into another AAE). Only the components mentioned in `ToVisitKeys` will run when the `Annotation Defined Flow Controller` is active.
Loading

0 comments on commit c964440

Please sign in to comment.