This project automates the conversion of scanned historical documents into a structured and searchable TEI XML format, facilitating analysis and archiving. It leverages the OCR-D processor suite for efficient handling of document images and incorporates a flexible approach to accommodate document variations.
- Introduction
- Project Goals
- Features
- Technologies
- System Requirements and Setup
- Running the Project
- Execution
- Contents of Files
This project automates the conversion of scanned historical documents into a structured and searchable TEI XML format, facilitating analysis and archiving. It leverages the ocrd processor suite for efficient handling of document images and incorporates a flexible approach to accommodate document variations.
- Structured Output: Generate TEI XML files, an established standard for encoding historical documents.
- Document Versatility: Handle documents with varying characteristics, employing appropriate processing steps based on their specific needs.
- Metadata Extraction: Extract relevant metadata like title, author, date, and page numbers.
- Processes scanned documents (potentially in PDF format).
- Performs pre-processing tasks (details to be added based on your implementation).
- Extracts text using Optical Character Recognition (OCR).
- Generates TEI XML files representing the document structure and content.
- Optical Character Recognition (OCR) Engine
- Python
- TEI XML Libraries
- Additional Libraries (Optional): Depending on your specific implementation, mention any additional libraries used for pre-processing tasks (e.g., image manipulation), post-processing (e.g., data normalization), or other functionalities
- Operating System: Ubuntu or similar Linux distribution (consider WSL for Windows users: https://learn.microsoft.com/en-us/windows/wsl/)
- Python: Python 3.x (Download from https://www.python.org/downloads/)
- ocrd_all Package: Provides tools for document image processing and OCR (installation instructions: https://ocr-d.de/en/setup)
- Additional libraries might be required based on the specific configuration.
- For instructions tailored to your needs, you can use the documentation here detailing various processing steps within the OCR-D framework.
- Clone the Repository: git clone https://github.com/VaishaliBurge29/Converting_Scanned_Documents_To_TEI_XML.git
- Project Structure:
- Data folder: This folder contains scanned documents, each folder has its images folder and processed files. You can organize these documents further using subfolders if needed.
- Images folder (Optional): If your pipeline involves working with intermediate image files during processing, this folder might store those temporary images.
- Output folder: This folder stores the final TEI XML output files generated by the pipeline.
- Prerequisites:
- Operating System: The pipeline is primarily developed for Ubuntu or similar Linux distributions. Instructions for other operating systems can be added if applicable (e.g., using WSL on Windows).
- Python: Ensure you have Python (Specify the required Python version if necessary) installed on your system.
This project doesn't require users to directly run a script. It's designed to process documents within the folders and generate TEI XML files in the output folder using the Python code provided.
- Data:
- All 24 document folders are uploaded in the
datafolder. - Each document in the
datafolder has the processed pages after going through the processing steps by processors. - Has Split Images.
- All 24 document folders are uploaded in the
- Output:
- The output folder is located inside the
datafolder, which includes, - Combined hOCR file.
- TEI XML conversion python code.
- TEI XML output file.
- The output folder is located inside the
- Code files:
- In the
libfolder you can find Python code for splitting tiff documents into individual documents and, - Combining hOCR code which is included in the main branch.
- In the
- Documentation:
- The
documentationfolder contains our Detailed Report.
- The