Validate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)
You can run the command line scripts and web interface as a Docker container, you only need Docker installed.
To start the web interface on http://localhost:8080:
docker run --rm -it -p 8080:8080 ubma/ocr-fileformatTo run the command line scripts, mount the directory containing your input
files into the container's /data directory:
docker run --rm -it -v "$PWD":/data ubma/ocr-fileformat ocr-transform alto2.0 hocr somefile.altoTo install system-wide to /usr/local:
sudo make installTo install without sudo to your home directory:
make install PREFIX=$HOME/.localIf $HOME/.local/bin is not in your PATH, add this to your shell startup file (e.g. ~/.bashrc or ~/.zshrc):
export PATH="$HOME/.local/bin $PATH"
The web application has a PHP backed. You can deploy it on any PHP-capable
server by copying the web folder somewhere below the document root
of your server, e.g. /var/www/html for Apache on Debian/Ubuntu:
sudo -u www-data cp -r web /var/www/html/ocr-fileformat
In this example the GUI would be available under http://localhost/ocr-fileformat/.
The project offers two functionalities, which can be accessd via a command line script (CLI), using a web interface (GUI) or in you own tools (API)
ocr-transform: Transformation of OCR output between OCR formatsocr-validate: Validation of OCR output against OCR format schemas
The web interface is for testing validation and transformations. You can upload a file or select an input file by URL.
$PREFIX/share/ocr-fileformat/xslt- XSLT stylesheets$PREFIX/share/ocr-fileformat/xsd- XSD schemas$PREFIX/share/ocr-fileformat/script/transform- Transformation scripts$PREFIX/share/ocr-fileformat/script/validate- Validation scripts
Usage: ocr-transform [-dl] <input-fmt> <output-fmt> [<input> [<output>]] [-- <saxon_opts>]
For example, you can transform an ALTO XML to a hOCR file with:
ocr-transform alto hocr sample.xml sample.hocrOr convert from ALTO XML (version 2.1) to hOCR with:
ocr-transform alto2.1 hocr sample.alto sample.hocrYou can also pass arguments directly to the Saxon CLI by passing them after a double dash (--). For example, to set the foo parameter to bar:
ocr-transform alto hocr sample.xml sample.hocr -- foo=barTry ocr-transform -h to get an overview:
Usage:
ocr-transform [OPTIONS] <from> <to> [<infile> [<outfile>]] [-- <script-args>]
ocr-transform [OPTIONS] <from> <to> --help-args Show script-args, and exit
ocr-transform [OPTIONS] -h|--help               Show this help, and exit
ocr-transform [OPTIONS] -v|--version            Show version, and exit
ocr-transform [OPTIONS] -L|--list               List available from/to, and exit
    Options:
        --debug   -d     Increase debug level by 1, can be repeated
    Transformations:
        abbyy hocr
        abbyy page
        alto hocr
        alto page
        alto text
        alto2.0 alto3.0
        alto2.0 alto3.1
        alto2.0 hocr
        alto2.1 alto3.0
        alto2.1 alto3.1
        alto2.1 hocr
        alto4.2 alto2.1
        gcv alto
        gcv hocr
        gcv page
        hocr alto
        hocr alto2.0
        hocr alto2.1
        hocr alto3.0
        hocr alto4.0
        hocr page
        hocr tei
        hocr text
        mybib alto3.0
        page alto
        page alto_legacy
        page hocr
        page page2019
        page text
        tei hocr
        textract page
Select the Transform menu option. Choose a URL, an input and an output
format. Click Transform.
The stylesheets are installed in $PREFIX/share/ocr-fileformat/xslt and can be
used directly in your scripts and software. You will need to use an XSLT 2.0
capable stylesheet transformer.
| From ╲ To | hOCR | ALTO | PAGEXML | TEI | Text | 
|---|---|---|---|---|---|
| hOCR | - | ✓ | ✓ | ✓ | ✓ | 
| ALTO | ✓ | ✓ | ✓ | - | ✓ | 
| PAGEXML | ✓ | ✓ | ✓ | - | ✓ | 
| ABBYY FineReader | ✓ | - | ✓ | - | - | 
| Google Cloud Vision | ✓ | ✓ | ✓ | - | - | 
| Amazon AWS Textract | - | - | ✓ | - | - | 
| TEI | ✓ | - | - | - | - | 
Usage:
ocr-validate [OPTIONS] <schema> <file> [<resultsFile>]
ocr-validate [OPTIONS] -h|--help       Show this help, and exit
ocr-validate [OPTIONS] -v|--version    Show version, and exit
ocr-validate [OPTIONS] -L|--list       List available schemas, and exit
    Options:
        --debug   -d     Increase debug level by 1, can be repeated
    Schemas:
        hocr
        alto-1-0 alto-1-1 alto-1-2 alto-1-3 alto-1-4 alto-2-0 alto-2-1 alto-2-2-draft alto-3-0 alto-3-1 alto-3-2-draft alto-4-0 alto-4-1 alto-4-2 alto-4-3
        abbyy-6-schema-v1 abbyy-8-schema-v2 abbyy-9-schema-v1 abbyy-10-schema-v1
        page-2009-03-16 page-2010-01-12 page-2010-03-19 page-2013-07-15 page-2016-07-15 page-2017-07-15 page-2018-07-15 page-2019-07-15
For example, to validate an XML file against the ALTO 3.1 schema:
ocr-validate alto-3-1 myFile.alto
Select the Validate menu option. Choose a URL and an schema. Click Validate.
The XSD files are installed under $PREFIX/share/ocr-fileformat/xsd
| hOCR | ALTO | PAGEXML | FineReader | Google Cloud Vision | Amazon AWS Textract | |
|---|---|---|---|---|---|---|
| Validation | ✓ | ✓ | ✓ | ✓ | - | - | 
This is free software. You may use it under the terms of the MIT License.
During the installation process several projects are included (in ./vendor). These projects have different licenses:
- Saxon HE 9.7, 
MPL. - ALTOXML schema, "Open Source" for ALTO <= 3.1, 
CC BY SA 4.0since ALTO 4.0 - PAGE schemas, 
? - xsd-validator by Adrian Mouat @amouat, 
Apache 2.0 - ABBYY FineReader XSD, 
? - hOCR-to-ALTO by Filip Kriz @filak, 
MIT - hocr-spec by Konstantin Baierer @kba, 
MIT - gcv2hocr by Endo Michiaki, 
CC BY 4.0 - format-converters by OCR-D, 
Apache 2.0 - prima-page-converter by PRImA Research Lab , 
Apache 2.0 - page-to-alto by Konstantin Baierer @kba, 
Apache 2.0 - textract2page by Arne Rümmler @rue-a, 
Apache 2.0 
