-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(actor): Docling Actor on Apify infrastructure #875
base: main
Are you sure you want to change the base?
Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
d7b9e41
to
f1ebb31
Compare
@vancura We really love this PR, but one question we have is if we can synchronize the API's from https://github.com/DS4SD/docling-serve with the API you will put in place. |
@vancura it doesn't support gpu, does it? what about supporting multipart/form-data next to pdf url? |
yes, it does support GPU and The key questions/proposals for us are:
|
I'd be happy to align with the docling-serve API rather than create a parallel implementation; adapting the Actor to leverage docling-serve directly would make more sense. This would involve:
This approach would maintain a consistent API across the entire Docling ecosystem while allowing users to benefit from the serverless deployment on Apify. The docling-serve system already has a well-designed API structure and robust error handling which we can leverage, rather than maintaining two parallel implementations.
The current Actor implementation runs on Apify's infrastructure, which doesn't support GPUs. It's optimized for CPU-based processing, though Docling can leverage GPUs when available (just not in the Apify case).
The Actor currently accepts document URLs but could be extended to support direct file uploads. This would require:
I can implement these changes if this functionality would be valuable to users.
I agree that aligning the input formats would provide consistency for users. After examining docling-serve, I recommend:
That said, if you prefer, we could create a hybrid approach where we:
This would allow users to use either system while ensuring consistent behavior and output formats. I will take a look at these improvements as soon as I can, hopefully next week. |
Hi @PeterStaar-IBM, @archasek, @dolfim-ibm, It makes me very happy to see how supportive you are of our work. Thank you for that! @vancura is no longer with Apify full-time but is still able to help. Due to that, we have temporarily limited availability to work on this. I’ll personally do as much as I can to help you get this PR accepted so it doesn’t get abandoned. If you eventually accept the PR, we would like to communicate it through our marketing channels to prove the concept internally and see whether there’s any traction in adoption. I’m happy to work on further refactoring and get Apify engineers involved once the concept is proven. If you have any other ideas on how to co-market, just let me know, I'm open to any sort of collaboration. What would be the minimal increment that would allow the PR to go through? Is it the Apify Input Object <> Docling Serve API interoperability? |
Hi, I just want to say I will work on this PR later this week! I am not going anywhere, no worries :) |
@PeterStaar-IBM What would be the minimal increment that would allow the PR to go through, so we can prioritize? Is it the Apify Input Object <> Docling Serve API interoperability? |
@netmilk yes, let's align the input/output. |
I've completed all the requested changes to the Docling Actor. Switching from the full Docling CLI to the more efficient docking API significantly improved the Actor. Major improvements since commit df8226f:
These improvements make the Actor more reliable, efficient, and maintainable. The Actor is live on Apify at https://apify.com/vancura/docling and fully functional. Please let me know if you'd like any further adjustments before merging! |
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
- Set proper ownership and permissions for runtime directory. - Switch to non-root user for enhanced security. - Use `--chown` flag in COPY commands to maintain correct file ownership. - Ensure all files and directories are owned by `appuser`. Signed-off-by: Václav Vančura <[email protected]>
- Combine RUN commands to reduce image layers and overall size. - Add non-root user `appuser` for improved security. - Use `--no-install-recommends` flag to minimize installed packages. - Install only necessary dependencies in a single RUN command. - Maintain proper cleanup of package lists and caches. Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Upgrade pip and npm to latest versions, pin docling to 2.15.1 and apify-cli to 2.7.1 for better stability and reproducibility. This change helps prevent unexpected behavior from dependency updates and ensures consistent builds across environments. Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Add and configure `/home/appuser/.apify` directory with proper permissions for the appuser in the Docker container. This ensures the Apify SDK has a writable home directory for storing its configuration and temporary files. Signed-off-by: Václav Vančura <[email protected]>
- Add `ACTOR_PATH_IN_DOCKER_CONTEXT` argument to ignore the Apify-tooling related warning. - Improve readability with consistent formatting and spacing in RUN commands. - Enhance security by properly setting up appuser home directory and permissions. - Streamline directory structure and ownership for runtime operations. - Remove redundant `.apify` directory creation as it's handled by the CLI. Signed-off-by: Václav Vančura <[email protected]>
The shell script has been enhanced with better error handling, input validation, and cleanup procedures. Key improvements include: - Added proper quoting around variables to prevent word splitting. - Improved error messages and logging functionality. - Implemented a cleanup trap to ensure temporary files are removed. - Enhanced validation of input parameters and output formats. - Added better handling of the log file and its storage. - Improved command execution with proper evaluation. - Added comments for better code readability and maintenance. - Fixed potential security issues with proper variable expansion. Signed-off-by: Václav Vančura <[email protected]>
- Initialize log file at `/tmp/docling.log` and redirect all output to it - Remove exit on error trap, now only logs error line numbers - Use temporary directory for timestamp file - Capture Docling exit code and handle errors more gracefully - Update log file references to use `LOG_FILE` variable - Remove local log file during cleanup Signed-off-by: Václav Vančura <[email protected]>
- Add installation of `time` and `procps` packages for better resource monitoring. - Set environment variables `PYTHONUNBUFFERED`, `MALLOC_ARENA_MAX`, and `EASYOCR_DOWNLOAD_CACHE` for improved performance. - Create a cache directory for EasyOCR to optimize storage usage. Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Removing the dollar signs due to what we discovered at https://cirosantilli.com/markdown-style-guide/#dollar-signs-in-shell-code Signed-off-by: Václav Vančura <[email protected]>
- `ERR_INVALID_INPUT` for missing document URL - `ERR_URL_INACCESSIBLE` for inaccessible URLs - `ERR_DOCLING_FAILED` for Docling command failures - `ERR_OUTPUT_MISSING` for missing or empty output files - `ERR_STORAGE_FAILED` for failures in storing the output document Signed-off-by: Václav Vančura <[email protected]>
- Add `apify pushData` calls to log errors when the document URL is missing or inaccessible. - Introduce dataset record creation with processing results, including a success status and output file URL. - Modify completion message to indicate successful processing and provide a link to the results. Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Added detailed information about the Actor's output storage to the `README.md`. This includes specifying where processed documents, processing logs, and dataset records are stored. Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
This commit transitions the Actor from using the full Docling CLI package to the more lightweight docling-serve API. Key changes include: - Redesign Dockerfile to use docling-serve as base image - Update actor.sh to communicate with API instead of running CLI commands - Improve content type handling for various output formats - Update input schema to align with API parameters - Reduce Docker image size from ~6GB to ~600MB - Update documentation and changelog to reflect architectural changes The image size reduction will make the Actor more cost-effective for users while maintaining all existing functionality including OCR capabilities. Issue: No official docling-serve Docker image is currently available, which will be addressed in a future commit. Signed-off-by: Václav Vančura <[email protected]>
This commit completely revamps the Actor implementation with two major improvements: 1) CRITICAL CHANGE: Switch to official docling-serve image * Now using quay.io/ds4sd/docling-serve-cpu:latest as base image * Eliminates need for custom docling installation * Ensures compatibility with latest docling-serve features * Provides more reliable and consistent document processing 2) Fix Apify Actor KVS storage issues: * Standardize key names to follow Apify conventions: - Change "OUTPUT_RESULT" to "OUTPUT" - Change "DOCLING_LOG" to "LOG" * Add proper multi-stage Docker build: - First stage builds dependencies including apify-cli - Second stage uses official image and adds only necessary tools * Fix permission issues in Docker container: - Set up proper user and directory permissions - Create writable directories for temporary files and models - Configure environment variables for proper execution 3) Solve EACCES permission errors during CLI version checks: * Create temporary HOME directory with proper write permissions * Set APIFY_DISABLE_VERSION_CHECK=1 environment variable * Add NODE_OPTIONS="--no-warnings" to suppress update checks * Support --no-update-notifier CLI flag when available 4) Improve code organization and reliability: * Create reusable upload_to_kvs() function for all KVS operations * Ensure log files are uploaded before tools directory is removed * Set proper MIME types based on output format * Add detailed error reporting and proper cleanup * Display final output URLs for easy verification This major refactoring significantly improves reliability and maintainability by leveraging the official docling-serve image while solving persistent permission and storage issues. The Actor now properly follows Apify standards while providing a more robust document processing pipeline. Signed-off-by: Václav Vančura <[email protected]>
Refactor the `actor.sh` script to modularize functions for finding the Apify CLI, setting up a temporary environment, and cleaning it up. Introduce a new function, `get_actor_input()`, to handle input detection more robustly. Replace inline Python conversion logic with an external script, `docling_processor.py`, for processing documents via the docling-serve API. Signed-off-by: Václav Vančura <[email protected]>
Signed-off-by: Václav Vančura <[email protected]>
(Sorry, these noisy commits above are here to make DCO happy). |
Signed-off-by: Václav Vančura <[email protected]>
Thank you @vancura, I've validated it and it works magic. It's 10x to 40x more effective. @PeterStaar-IBM @dolfim-ibm, would you mind, please, indicating whether there are any outstanding issues that might be a blocker for the PR to be merged? I'm happy to help with anything. |
I see that in the current implementation you switched to call docling-serve internally but still use the new custom input schema format that you defined. This is not really what we were providing as comment The request was: let's expose only one input schema to the user. The fact of using docling-serve is a suggestion, in case you plan to expose it directly. If you anyway have to wrap it, then it might just introduce extra iterations. |
Thank you for the quick info. I think that it might be quite challenging to make the HTTP/REST design pattern compatible with the Web Actor Programming Model. Just to make sure, you're asking to convert the Actor input schema ( So the intention is to make the
The output of the Actor is always going to be a list of links to files saved in the Actor object storage. @dolfim-ibm Please confirm this is what you desire for us to do or explain where I misunderstood your requirements. @vancura if you have better idea for the output compatibility, please suggest |
Our point is simplifying the amount of different input schemas that the users have to deal with. Having the same API that we will promote in docling-serve should also increase adoption of Apify. Also note that in the payload posted above, 95% of the arguments are options. Apify could simply rely on those default to simplify it. |
Dear Docling maintainers,
I have wrapped Docling as an Apify Actor by adding the Actor definition in the
.actor
directory and published the Docling Actor on Apify Store. I've also added the Actor status badge and a brief usage description to the README, including the “Run on Apify” button.For the full description of the Actor, please see the README file in the
.actor
directory.Docling can now be used in the cloud without installation, free of charge. Users can avoid managing Python, OCR libraries, and ML model dependencies locally. The Actor can be used either from Apify Console, API, or CLI locally:
The Actor processes documents and stores the results in Apify's key-value store under the
OUTPUT
key. It supports multiple output formats:Technical implementation
The Actor provides:
The Actor uses the official
quay.io/ds4sd/docling-serve-cpu
Docker image (~4GB) with all necessary dependencies:Note: The first Actor run may take 1-2 minutes to start as the container initializes. This is normal behavior, and users shouldn't terminate the run prematurely.
Apify will sponsor your project
All the links to Apify in this PR are affiliate links under the Apify open source fair share program with id
docling
in the passive tier of the program. In the passive tier, Apify commits to sending a monthly commission via the GitHub Sponsor button from all new sign-ups that come through your link. The only action required on your part is to accept the pull request and ensure your GitHub Sponsor button is set up.You can earn a larger commission and gain insights into traffic by registering directly with Apify, claiming ownership of the Actor on the Apify Store, and maintaining the Actor yourself. Simply contact support after signing up and pass the ownership challenge. The Actor will then be transferred, e.g., to
ds4sd/docling
, and you’ll see it under your Apify account.To further increase your income from Apify, you can convert your Actor on Apify Store to the pay-per-event pricing model and join the active developer tier. We offer an individual competitive advantage for the active developer tier in the form of either a significantly reduced Apify margin or discounted compute unit pricing. Feel free to ask for it!
Benefits of the Actor Programming Model
The Web Actor Programming Model is a new concept for building serverless microapps, which are easy to develop, share, integrate, and build upon. Actors are a reincarnation of the UNIX philosophy for programs running in the cloud. Actors are web automation scripts that are easy to integrate and scale up. The main benefit is that even a small piece of software can be turned into a public cloud service in a heartbeat.
Apify is the largest ecosystem where developers build, deploy, and publish data extraction, web automation tools, and AI agents. With over 3,000 Actors on Apify Store and 10 years of experience in the market, Apify makes Docling accessible to over 250,000 developers using the platform monthly. This also enables integration with other Actors on Store, custom Actors, and platform integrations that can create much more powerful workflows than just individual parts.
Full disclosure
I work at Apify. Apify doesn’t sell your software, but we sell the computing resources needed to run your software in the cloud to the end users. Your project is one of the first we selected to pilot Apify's open source fair share program. Please let me know if there’s anything I can do to help you accept this PR! If you do, we’d be pleased to feature your project in our marketing communication.
If you have any questions or need assistance, don’t hesitate to reach out to me (@vancura) or @netmilk, the Apify VP of DX, or just write us to [email protected].