Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(actor): Docling Actor on Apify infrastructure #875

Open
wants to merge 41 commits into
base: main
Choose a base branch
from

Conversation

vancura
Copy link

@vancura vancura commented Feb 3, 2025

Dear Docling maintainers,

I have wrapped Docling as an Apify Actor by adding the Actor definition in the .actor directory and published the Docling Actor on Apify Store. I've also added the Actor status badge and a brief usage description to the README, including the “Run on Apify” button.

For the full description of the Actor, please see the README file in the .actor directory.

Docling can now be used in the cloud without installation, free of charge. Users can avoid managing Python, OCR libraries, and ML model dependencies locally. The Actor can be used either from Apify Console, API, or CLI locally:

apify call vancura/docling -i '{
    "documentUrl": "https://arxiv.org/pdf/2408.09869.pdf",
    "outputFormat": "json",
    "ocr": true
}'

The Actor processes documents and stores the results in Apify's key-value store under the OUTPUT key. It supports multiple output formats:

  • Markdown
  • JSON
  • HTML
  • Plain text
  • Doctags (structured format)

Technical implementation

The Actor provides:

  • Cloud-based document processing through Apify's infrastructure
  • API access for easy integration
  • Support for multiple output formats
  • OCR capabilities for scanned documents
  • Integration potential with other Apify Actors
  • Clean error handling and input validation
  • Comprehensive output handling:
    • Processed documents in key-value store
    • Detailed processing logs
    • Dataset records with result URLs and status

The Actor uses the official quay.io/ds4sd/docling-serve-cpu Docker image (~4GB) with all necessary dependencies:

  • The docling-serve REST API for document processing
  • OCR libraries and ML models
  • Node.js 20.x for Apify CLI integration
  • Multi-stage Docker build for optimized size
  • All required system binaries

Note: The first Actor run may take 1-2 minutes to start as the container initializes. This is normal behavior, and users shouldn't terminate the run prematurely.

Apify will sponsor your project

All the links to Apify in this PR are affiliate links under the Apify open source fair share program with id docling in the passive tier of the program. In the passive tier, Apify commits to sending a monthly commission via the GitHub Sponsor button from all new sign-ups that come through your link. The only action required on your part is to accept the pull request and ensure your GitHub Sponsor button is set up.

You can earn a larger commission and gain insights into traffic by registering directly with Apify, claiming ownership of the Actor on the Apify Store, and maintaining the Actor yourself. Simply contact support after signing up and pass the ownership challenge. The Actor will then be transferred, e.g., to ds4sd/docling, and you’ll see it under your Apify account.

To further increase your income from Apify, you can convert your Actor on Apify Store to the pay-per-event pricing model and join the active developer tier. We offer an individual competitive advantage for the active developer tier in the form of either a significantly reduced Apify margin or discounted compute unit pricing. Feel free to ask for it!

Benefits of the Actor Programming Model

The Web Actor Programming Model is a new concept for building serverless microapps, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud. Actors are web automation scripts that are easy to integrate and scale up. The main benefit is that even a small piece of software can be turned into a public cloud service in a heartbeat.

Apify is the largest ecosystem where developers build, deploy, and publish data extraction, web automation tools, and AI agents. With over 3,000 Actors on Apify Store and 10 years of experience in the market, Apify makes Docling accessible to over 250,000 developers using the platform monthly. This also enables integration with other Actors on Store, custom Actors, and platform integrations that can create much more powerful workflows than just individual parts.

Full disclosure

I work at Apify. Apify doesn’t sell your software, but we sell the computing resources needed to run your software in the cloud to the end users. Your project is one of the first we selected to pilot Apify's open source fair share program. Please let me know if there’s anything I can do to help you accept this PR! If you do, we’d be pleased to feature your project in our marketing communication.

If you have any questions or need assistance, don’t hesitate to reach out to me (@vancura) or @netmilk, the Apify VP of DX, or just write us to [email protected].

Copy link

mergify bot commented Feb 3, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@vancura vancura changed the title Docling Actor on Apify infrastructure feat(actor): Docling Actor on Apify infrastructure Feb 3, 2025
@vancura vancura marked this pull request as ready for review February 5, 2025 20:50
@vancura vancura force-pushed the main branch 2 times, most recently from d7b9e41 to f1ebb31 Compare February 6, 2025 10:18
@PeterStaar-IBM
Copy link
Contributor

@vancura We really love this PR, but one question we have is if we can synchronize the API's from https://github.com/DS4SD/docling-serve with the API you will put in place.

@archasek
Copy link

@vancura it doesn't support gpu, does it? what about supporting multipart/form-data next to pdf url?

@dolfim-ibm
Copy link
Contributor

@vancura it doesn't support gpu, does it? what about supporting multipart/form-data next to pdf url?

yes, it does support GPU and multipart/form-data is supported.

The key questions/proposals for us are:

  1. Let's align the input format of the APIs, so that users can easily switch between the systems.
  2. Would it make sense to run directly the docling-serve image in Apify, or the usual approach is to wrap it as you did?

@vancura
Copy link
Author

vancura commented Feb 28, 2025

We really love this PR, but one question we have is if we can synchronize the API's from DS4SD/docling-serve with the API you will put in place.

I'd be happy to align with the docling-serve API rather than create a parallel implementation; adapting the Actor to leverage docling-serve directly would make more sense. This would involve:

  1. rewriting the Actor to use docling-serve as the underlying engine instead of calling the Docling CLI directly;
  2. ensuring the Actor's input schema matches docling-serve's API parameters exactly;
  3. adding an adapter layer to connect docling-serve's outputs with Apify's storage system.

This approach would maintain a consistent API across the entire Docling ecosystem while allowing users to benefit from the serverless deployment on Apify. The docling-serve system already has a well-designed API structure and robust error handling which we can leverage, rather than maintaining two parallel implementations.

It doesn't support gpu, does it?

The current Actor implementation runs on Apify's infrastructure, which doesn't support GPUs. It's optimized for CPU-based processing, though Docling can leverage GPUs when available (just not in the Apify case).

What about supporting multipart/form-data next to pdf url?

The Actor currently accepts document URLs but could be extended to support direct file uploads. This would require:

  • modifying the input schema to accept base64-encoded files or implementing a temporary storage solution;
  • updating the processing script to handle the uploaded files;
  • adding a file upload interface in the Actor's web UI.

I can implement these changes if this functionality would be valuable to users.

Let's align the input format of the APIs, so that users can easily switch between the systems.

I agree that aligning the input formats would provide consistency for users. After examining docling-serve, I recommend:

  • API Alignment: I can update the Actor's input schema to match docling-serve's API structure, supporting the same parameters and options. This would create a consistent experience regardless of which system users choose.
  • Container Strategy: While running docling-serve directly on Apify is possible, the wrapper approach I implemented offers several advantages:
    • better integration with Apify's platform features (key-value stores, datasets, webhooks);
    • optimized for Apify's infrastructure constraints;
    • simpler monitoring and logging specific to the Apify environment;
    • easier maintenance and updates.

That said, if you prefer, we could create a hybrid approach where we:

  1. use docling-serve as the base container;
  2. add a thin adapter layer to connect it with Apify's platform;
  3. maintain API compatibility between both systems.

This would allow users to use either system while ensuring consistent behavior and output formats.

I will take a look at these improvements as soon as I can, hopefully next week.

@netmilk
Copy link

netmilk commented Mar 4, 2025

We really love this PR,..

Hi @PeterStaar-IBM, @archasek, @dolfim-ibm,

It makes me very happy to see how supportive you are of our work. Thank you for that!

@vancura is no longer with Apify full-time but is still able to help. Due to that, we have temporarily limited availability to work on this. I’ll personally do as much as I can to help you get this PR accepted so it doesn’t get abandoned.

If you eventually accept the PR, we would like to communicate it through our marketing channels to prove the concept internally and see whether there’s any traction in adoption. I’m happy to work on further refactoring and get Apify engineers involved once the concept is proven. If you have any other ideas on how to co-market, just let me know, I'm open to any sort of collaboration.

What would be the minimal increment that would allow the PR to go through? Is it the Apify Input Object <> Docling Serve API interoperability?

@vancura
Copy link
Author

vancura commented Mar 4, 2025

Hi, I just want to say I will work on this PR later this week! I am not going anywhere, no worries :)

@PeterStaar-IBM
Copy link
Contributor

@netmilk @vancura Good, let's try to target to merge this latest by March 12th.

@netmilk
Copy link

netmilk commented Mar 5, 2025

@PeterStaar-IBM What would be the minimal increment that would allow the PR to go through, so we can prioritize? Is it the Apify Input Object <> Docling Serve API interoperability?

@dolfim-ibm
Copy link
Contributor

@netmilk yes, let's align the input/output.

@vancura
Copy link
Author

vancura commented Mar 9, 2025

I've completed all the requested changes to the Docling Actor. Switching from the full Docling CLI to the more efficient docking API significantly improved the Actor.

Major improvements since commit df8226f:

  1. Switched to docling-serve API: Now using the official quay.io/ds4sd/docling-serve-cpu Docker image instead of custom installation
  2. Reduced Docker image size: From ~6GB to ~4GB, improving download speed and resource usage
  3. Improved API compatibility: Updated endpoints and payload structure to match docling-serve API format
  4. Enhanced response handling: Added a dedicated Python processor script for reliable API communication
  5. Multi-stage Docker build: More efficient container with only necessary dependencies
  6. Better error handling: Improved error detection, reporting, and recovery
  7. Enhanced startup health checks: Ensures the API is fully functional before processing

These improvements make the Actor more reliable, efficient, and maintainable. The Actor is live on Apify at https://apify.com/vancura/docling and fully functional.

Please let me know if you'd like any further adjustments before merging!

vancura added 15 commits March 9, 2025 16:26
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
- Set proper ownership and permissions for runtime directory.
- Switch to non-root user for enhanced security.
- Use `--chown` flag in COPY commands to maintain correct file ownership.
- Ensure all files and directories are owned by `appuser`.

Signed-off-by: Václav Vančura <[email protected]>
- Combine RUN commands to reduce image layers and overall size.
- Add non-root user `appuser` for improved security.
- Use `--no-install-recommends` flag to minimize installed packages.
- Install only necessary dependencies in a single RUN command.
- Maintain proper cleanup of package lists and caches.

Signed-off-by: Václav Vančura <[email protected]>
Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments.

Signed-off-by: Václav Vančura <[email protected]>
Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files.

Signed-off-by: Václav Vančura <[email protected]>
- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning.
- Improve readability with consistent formatting and spacing in RUN commands.
- Enhance security by properly setting up appuser home directory and permissions.
- Streamline directory structure and ownership for runtime operations.
- Remove redundant `.apify` directory creation as it's handled by the CLI.

Signed-off-by: Václav Vančura <[email protected]>
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include:

- Added proper quoting around variables to prevent word splitting.
- Improved error messages and logging functionality.
- Implemented a cleanup trap to ensure temporary files are removed.
- Enhanced validation of input parameters and output formats.
- Added better handling of the log file and its storage.
- Improved command execution with proper evaluation.
- Added comments for better code readability and maintenance.
- Fixed potential security issues with proper variable expansion.

Signed-off-by: Václav Vančura <[email protected]>
- Initialize log file at `/tmp/docling.log` and redirect all output to it
- Remove exit on error trap, now only logs error line numbers
- Use temporary directory for timestamp file
- Capture Docling exit code and handle errors more gracefully
- Update log file references to use `LOG_FILE` variable
- Remove local log file during cleanup

Signed-off-by: Václav Vančura <[email protected]>
vancura added 22 commits March 9, 2025 16:26
- Add installation of `time` and `procps` packages for better resource monitoring.
- Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance.
- Create a cache directory for EasyOCR to optimize storage usage.

Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code

Signed-off-by: Václav Vančura <[email protected]>
- `ERR_INVALID_INPUT` for missing document URL
- `ERR_URL_INACCESSIBLE` for inaccessible URLs
- `ERR_DOCLING_FAILED` for Docling command failures
- `ERR_OUTPUT_MISSING` for missing or empty output files
- `ERR_STORAGE_FAILED` for failures in storing the output document

Signed-off-by: Václav Vančura <[email protected]>
- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible.
- Introduce dataset record creation with processing results, including a success status and output file URL.
- Modify completion message to indicate successful processing and provide a link to the results.

Signed-off-by: Václav Vančura <[email protected]>
Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored.

Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include:

- Redesign Dockerfile to use docling-serve as base image
- Update actor.sh to communicate with API instead of running CLI commands
- Improve content type handling for various output formats
- Update input schema to align with API parameters
- Reduce Docker image size from ~6GB to ~600MB
- Update documentation and changelog to reflect architectural changes

The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities.

Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit.
Signed-off-by: Václav Vančura <[email protected]>
This commit completely revamps the Actor implementation with two major improvements:

1) CRITICAL CHANGE: Switch to official docling-serve image
   * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image
   * Eliminates need for custom docling installation
   * Ensures compatibility with latest docling-serve features
   * Provides more reliable and consistent document processing

2) Fix Apify Actor KVS storage issues:
   * Standardize key names to follow Apify conventions:
     - Change "OUTPUT_RESULT" to "OUTPUT"
     - Change "DOCLING_LOG" to "LOG"
   * Add proper multi-stage Docker build:
     - First stage builds dependencies including apify-cli
     - Second stage uses official image and adds only necessary tools
   * Fix permission issues in Docker container:
     - Set up proper user and directory permissions
     - Create writable directories for temporary files and models
     - Configure environment variables for proper execution

3) Solve EACCES permission errors during CLI version checks:
   * Create temporary HOME directory with proper write permissions
   * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable
   * Add NODE_OPTIONS="--no-warnings" to suppress update checks
   * Support --no-update-notifier CLI flag when available

4) Improve code organization and reliability:
   * Create reusable upload_to_kvs() function for all KVS operations
   * Ensure log files are uploaded before tools directory is removed
   * Set proper MIME types based on output format
   * Add detailed error reporting and proper cleanup
   * Display final output URLs for easy verification

This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline.

Signed-off-by: Václav Vančura <[email protected]>
Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API.

Signed-off-by: Václav Vančura <[email protected]>
@vancura
Copy link
Author

vancura commented Mar 9, 2025

(Sorry, these noisy commits above are here to make DCO happy).

@netmilk
Copy link

netmilk commented Mar 11, 2025

Thank you @vancura, I've validated it and it works magic. It's 10x to 40x more effective.

@PeterStaar-IBM @dolfim-ibm, would you mind, please, indicating whether there are any outstanding issues that might be a blocker for the PR to be merged? I'm happy to help with anything.

@dolfim-ibm
Copy link
Contributor

I see that in the current implementation you switched to call docling-serve internally but still use the new custom input schema format that you defined. This is not really what we were providing as comment

The request was: let's expose only one input schema to the user.

The fact of using docling-serve is a suggestion, in case you plan to expose it directly. If you anyway have to wrap it, then it might just introduce extra iterations.

@netmilk
Copy link

netmilk commented Mar 11, 2025

Thank you for the quick info. I think that it might be quite challenging to make the HTTP/REST design pattern compatible with the Web Actor Programming Model.

Just to make sure, you're asking to convert the Actor input schema (.actor/input_schema.json) in this PR, to the structure of the POST /v1alpha/convert/source request body JSON schema as defined in the Docling openapi.json.

So the intention is to make the docling-serve curl example input object compatible with apify call input object

$ echo '{
  "options": {
    "from_formats": [
      "docx",
      "pptx",
      "html",
      "image",
      "pdf",
      "asciidoc",
      "md",
      "xlsx"
    ],
    "to_formats": ["md", "json", "html", "text", "doctags"],
    "image_export_mode": "placeholder",
    "do_ocr": true,
    "force_ocr": false,
    "ocr_engine": "easyocr",
    "ocr_lang": [
      "fr",
      "de",
      "es",
      "en"
    ],
    "pdf_backend": "dlparse_v2",
    "table_mode": "fast",
    "abort_on_error": false,
    "return_as_file": false,
    "do_table_structure": true,
    "include_images": true,
    "images_scale": 2
  },
  "http_sources": [{"url": "https://arxiv.org/pdf/2206.01062"}]
}' > input.json

$ cat input.json | apify call vancura/docling

The output of the Actor is always going to be a list of links to files saved in the Actor object storage.

@dolfim-ibm Please confirm this is what you desire for us to do or explain where I misunderstood your requirements.

@vancura if you have better idea for the output compatibility, please suggest

@dolfim-ibm
Copy link
Contributor

Our point is simplifying the amount of different input schemas that the users have to deal with. Having the same API that we will promote in docling-serve should also increase adoption of Apify.

Also note that in the payload posted above, 95% of the arguments are options. Apify could simply rely on those default to simplify it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants