GitHub - VaishaliBurge29/Converting_Scanned_Documents_To_TEI_XML

Document Processing Pipeline for Historical Documents (TEI XML Output)

This project automates the conversion of scanned historical documents into a structured and searchable TEI XML format, facilitating analysis and archiving. It leverages the OCR-D processor suite for efficient handling of document images and incorporates a flexible approach to accommodate document variations.

Document Processing Pipeline for Historical Documents (TEI XML Output)

This project automates the conversion of scanned historical documents into a structured and searchable TEI XML format, facilitating analysis and archiving. It leverages the ocrd processor suite for efficient handling of document images and incorporates a flexible approach to accommodate document variations.

Project Goals

Structured Output: Generate TEI XML files, an established standard for encoding historical documents.
Document Versatility: Handle documents with varying characteristics, employing appropriate processing steps based on their specific needs.
Metadata Extraction: Extract relevant metadata like title, author, date, and page numbers.

Features

Processes scanned documents (potentially in PDF format).
Performs pre-processing tasks (details to be added based on your implementation).
Extracts text using Optical Character Recognition (OCR).
Generates TEI XML files representing the document structure and content.

Technologies

Optical Character Recognition (OCR) Engine
Python
TEI XML Libraries
Additional Libraries (Optional): Depending on your specific implementation, mention any additional libraries used for pre-processing tasks (e.g., image manipulation), post-processing (e.g., data normalization), or other functionalities

System Requirements and Setup

Operating System: Ubuntu or similar Linux distribution (consider WSL for Windows users: https://learn.microsoft.com/en-us/windows/wsl/)
Python: Python 3.x (Download from https://www.python.org/downloads/)
ocrd_all Package: Provides tools for document image processing and OCR (installation instructions: https://ocr-d.de/en/setup)

Additional libraries might be required based on the specific configuration.
For instructions tailored to your needs, you can use the documentation here detailing various processing steps within the OCR-D framework.

Running the Project

Installation:

Clone the Repository: git clone https://github.com/VaishaliBurge29/Converting_Scanned_Documents_To_TEI_XML.git
Project Structure:
- Data folder: This folder contains scanned documents, each folder has its images folder and processed files. You can organize these documents further using subfolders if needed.
- Images folder (Optional): If your pipeline involves working with intermediate image files during processing, this folder might store those temporary images.
- Output folder: This folder stores the final TEI XML output files generated by the pipeline.
Prerequisites:
- Operating System: The pipeline is primarily developed for Ubuntu or similar Linux distributions. Instructions for other operating systems can be added if applicable (e.g., using WSL on Windows).
- Python: Ensure you have Python (Specify the required Python version if necessary) installed on your system.

Execution

This project doesn't require users to directly run a script. It's designed to process documents within the folders and generate TEI XML files in the output folder using the Python code provided.

Contents of Files

Data:
- All 24 document folders are uploaded in the data folder.
- Each document in the data folder has the processed pages after going through the processing steps by processors.
- Has Split Images.
Output:
- The output folder is located inside the data folder, which includes,
- Combined hOCR file.
- TEI XML conversion python code.
- TEI XML output file.
Code files:
- In the lib folder you can find Python code for splitting tiff documents into individual documents and,
- Combining hOCR code which is included in the main branch.
Documentation:
- The documentation folder contains our Detailed Report.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
data		data
documentation		documentation
lib		lib
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Processing Pipeline for Historical Documents (TEI XML Output)

Table of Contents

Document Processing Pipeline for Historical Documents (TEI XML Output)

Project Goals

Features

Technologies

System Requirements and Setup

Running the Project

Installation:

Execution

Contents of Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

VaishaliBurge29/Converting_Scanned_Documents_To_TEI_XML

Folders and files

Latest commit

History

Repository files navigation

Document Processing Pipeline for Historical Documents (TEI XML Output)

Table of Contents

Document Processing Pipeline for Historical Documents (TEI XML Output)

Project Goals

Features

Technologies

System Requirements and Setup

Running the Project

Installation:

Execution

Contents of Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages