Docling Actor on Apify

This Actor (specification v1) wraps the Docling project to provide serverless document processing in the cloud. It can process complex documents (PDF, DOCX, images) and convert them into structured formats (Markdown, JSON, HTML, Text, or DocTags) with optional OCR support.

What are Actors?

Actors are serverless microservices running on the Apify Platform. They are based on the Actor SDK and can be found in the Apify Store. Learn more about Actors in the Apify Whitepaper.

Features

Runs Docling v2.17.0 in a fully managed environment on Apify
Processes multiple document formats:
- PDF documents (scanned or digital)
- Microsoft Office files (DOCX, XLSX, PPTX)
- Images (PNG, JPG, TIFF)
- Other text-based formats
Provides OCR capabilities for scanned documents
Exports to multiple formats:
- Markdown
- JSON
- HTML
- Plain Text
- DocTags (structured format)
No local setup needed—just provide input via a simple JSON config

Usage

Using Apify Console

Go to the Apify Actor page.
Click "Run".
In the input form, fill in:
- The URL of the document.
- Output format (md, json, html, text, or doctags).
- OCR boolean toggle.
The Actor will run and produce its outputs in the default key-value store under the key OUTPUT_RESULT.

Using Apify API

curl --request POST \
  --url "https://api.apify.com/v2/acts/username~actorname/run" \
  --header 'Content-Type: application/json' \
  --header 'Authorization: Bearer YOUR_API_TOKEN' \
  --data '{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "md",
    "ocr": true
  }'

Using Apify CLI

apify call username/actorname --input='{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "md",
    "ocr": true
}'

Input Parameters

The Actor accepts a JSON schema matching the file .actor/input_schema.json. Below is a summary of the fields:

Field	Type	Required	Default	Description
`documentUrl`	string	Yes	None	URL of the document (PDF, image, DOCX, etc.) to be processed. Must be directly accessible via public URL.
`outputFormat`	string	No	`md`	Desired output format. One of `md`, `json`, `html`, `text`, or `doctags`.
`ocr`	boolean	No	`true`	If set to true, OCR will be applied to scanned PDFs or images for text recognition.

Example Input

{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "md",
    "ocr": false
}

Output

The Actor provides three types of outputs:

Processed Document - The Actor will provide the direct URL to your result in the run log, looking like:

You can find your results at: 'https://api.apify.com/v2/key-value-stores/[YOUR_STORE_ID]/records/OUTPUT_RESULT'

Processing Log - Available in the key-value store as DOCLING_LOG
Dataset Record - Contains processing metadata with:
- Input document URL
- Direct link to the processed output
- Processing status

You can access the results in several ways:

Direct URL (shown in Actor run logs):

https://api.apify.com/v2/key-value-stores/[STORE_ID]/records/OUTPUT_RESULT

Programmatically via Apify CLI:

apify key-value-stores get-value OUTPUT_RESULT

Dataset - Check the "Dataset" tab in the Actor run details to see processing metadata

Example Outputs

Markdown (md)

# Document Title

## Section 1
Content of section 1...

## Section 2
Content of section 2...

JSON

{
    "title": "Document Title",
    "sections": [
        {
            "level": 1,
            "title": "Section 1",
            "content": "Content of section 1..."
        }
    ]
}

HTML

<h1>Document Title</h1>
<h2>Section 1</h2>
<p>Content of section 1...</p>

Processing Logs (`DOCLING_LOG`)

The Actor maintains detailed processing logs including:

Memory usage statistics
Processing steps and timing
Error messages and stack traces
Input validation results
OCR processing details (when enabled)

Access logs via:

apify key-value-stores get-record DOCLING_LOG

Performance & Resources

Docker Image Size: ~6 GB (includes OCR libraries and ML models)
Memory Requirements:
- Minimum: 4 GB RAM
- Recommended: 8 GB RAM for large documents
Memory Monitoring:
- Real-time memory usage tracking during processing
- Detailed memory statistics in DOCLING_LOG
- Automatic failure detection for out-of-memory situations
Processing Time:
- Simple documents: 30-60 seconds
- Complex PDFs with OCR: 2-5 minutes
- Large documents (100+ pages): 5-15 minutes

Troubleshooting

Common issues and solutions:

Document URL Not Accessible
- Ensure the URL is publicly accessible
- Check if the document requires authentication
- Verify the URL leads directly to the document
OCR Processing Fails
- Verify the document is not password-protected
- Check if the image quality is sufficient
- Try processing with OCR disabled
Memory Issues
- For large documents, try splitting them into smaller chunks
- Consider using a higher-memory compute unit
- Disable OCR if not strictly necessary
Output Format Issues
- Verify the output format is supported
- Check if the document structure is compatible
- Review the DOCLING_LOG for specific errors

Error Handling

The Actor implements comprehensive error handling:

Input validation for document URLs and parameters
Detailed error messages in DOCLING_LOG
Proper exit codes for different failure scenarios
Memory monitoring and out-of-memory detection
Automatic cleanup on failure
Dataset records with processing status

Local Development

If you wish to develop or modify this Actor locally:

Clone the repository.
Ensure Docker is installed.
The Actor files are located in the .actor directory:
- Dockerfile - Defines the container environment
- actor.json - Actor configuration and metadata
- actor.sh - Main execution script
- input_schema.json - Input parameter definitions
- .dockerignore - Build optimization rules
Run the Actor locally using:
```
apify run
```

Actor Structure

.actor/
├── Dockerfile          # Container definition
├── actor.json          # Actor metadata
├── actor.sh            # Execution script
├── input_schema.json   # Input parameters
├── .dockerignore       # Build exclusions
└── README.md           # This documentation

Requirements & Installation

An Apify account (free tier available)
For local development:
- Docker installed
- Apify CLI (npm install -g apify-cli)
- Git for version control
The Actor's Docker image (~6 GB) includes:
- Python 3.11 with optimized caching (.pyc, .pyo excluded)
- Node.js 20.x
- Docling v2.17.0 and its dependencies
- OCR libraries and ML models

Build Optimizations

The Actor uses several optimizations to maintain efficiency:

Python cache files (pycache, .pyc, .pyo, .pyd) are excluded
Development artifacts (.git, .env, .venv) are ignored
Log and test files (*.log, .pytest_cache, .coverage) are excluded from builds

License

This wrapper project is under the MIT License, matching the original Docling license. See LICENSE for details.

Acknowledgments

Docling codebase by IBM
Apify for the serverless actor environment

Security Considerations

Actor runs under a non-root user (appuser) for enhanced security
Input URLs are validated before processing
Temporary files are securely managed and cleaned up
Process isolation through Docker containerization
Secure handling of processing artifacts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Docling Actor on Apify

What are Actors?

Table of Contents

Features

Usage

Using Apify Console

Using Apify API

Using Apify CLI

Input Parameters

Example Input

Output

Example Outputs

Markdown (md)

JSON

HTML

Processing Logs (`DOCLING_LOG`)

Performance & Resources

Troubleshooting

Error Handling

Local Development

Actor Structure

Requirements & Installation

Build Optimizations

License

Acknowledgments

Security Considerations

Files

README.md

Latest commit

History

README.md

File metadata and controls

Docling Actor on Apify

What are Actors?

Table of Contents

Features

Usage

Using Apify Console

Using Apify API

Using Apify CLI

Input Parameters

Example Input

Output

Example Outputs

Markdown (md)

JSON

HTML

Processing Logs (DOCLING_LOG)

Performance & Resources

Troubleshooting

Error Handling

Local Development

Actor Structure

Requirements & Installation

Build Optimizations

License

Acknowledgments

Security Considerations

Processing Logs (`DOCLING_LOG`)