Skip to content

VaishaliBurge29/Converting_Scanned_Documents_To_TEI_XML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Document Processing Pipeline for Historical Documents (TEI XML Output)

This project automates the conversion of scanned historical documents into a structured and searchable TEI XML format, facilitating analysis and archiving. It leverages the OCR-D processor suite for efficient handling of document images and incorporates a flexible approach to accommodate document variations.

Table of Contents

Document Processing Pipeline for Historical Documents (TEI XML Output)

This project automates the conversion of scanned historical documents into a structured and searchable TEI XML format, facilitating analysis and archiving. It leverages the ocrd processor suite for efficient handling of document images and incorporates a flexible approach to accommodate document variations.

Project Goals

  • Structured Output: Generate TEI XML files, an established standard for encoding historical documents.
  • Document Versatility: Handle documents with varying characteristics, employing appropriate processing steps based on their specific needs.
  • Metadata Extraction: Extract relevant metadata like title, author, date, and page numbers.

Features

  • Processes scanned documents (potentially in PDF format).
  • Performs pre-processing tasks (details to be added based on your implementation).
  • Extracts text using Optical Character Recognition (OCR).
  • Generates TEI XML files representing the document structure and content.

Technologies

  • Optical Character Recognition (OCR) Engine
  • Python
  • TEI XML Libraries
  • Additional Libraries (Optional): Depending on your specific implementation, mention any additional libraries used for pre-processing tasks (e.g., image manipulation), post-processing (e.g., data normalization), or other functionalities

System Requirements and Setup

  • Additional libraries might be required based on the specific configuration.
  • For instructions tailored to your needs, you can use the documentation here detailing various processing steps within the OCR-D framework.

Running the Project

Installation:
  1. Clone the Repository: git clone https://github.com/VaishaliBurge29/Converting_Scanned_Documents_To_TEI_XML.git
  2. Project Structure:
    • Data folder: This folder contains scanned documents, each folder has its images folder and processed files. You can organize these documents further using subfolders if needed.
    • Images folder (Optional): If your pipeline involves working with intermediate image files during processing, this folder might store those temporary images.
    • Output folder: This folder stores the final TEI XML output files generated by the pipeline.
  3. Prerequisites:
    • Operating System: The pipeline is primarily developed for Ubuntu or similar Linux distributions. Instructions for other operating systems can be added if applicable (e.g., using WSL on Windows).
    • Python: Ensure you have Python (Specify the required Python version if necessary) installed on your system.

Execution

This project doesn't require users to directly run a script. It's designed to process documents within the folders and generate TEI XML files in the output folder using the Python code provided.

Contents of Files

  1. Data:
    • All 24 document folders are uploaded in the data folder.
    • Each document in the data folder has the processed pages after going through the processing steps by processors.
    • Has Split Images.
  2. Output:
    • The output folder is located inside the data folder, which includes,
    • Combined hOCR file.
    • TEI XML conversion python code.
    • TEI XML output file.
  3. Code files:
    • In the lib folder you can find Python code for splitting tiff documents into individual documents and,
    • Combining hOCR code which is included in the main branch.
  4. Documentation:
    • The documentation folder contains our Detailed Report.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •