How can I efficiently replace translated text in a PDF while preserving original layout, font styles, and tables using PyMuPDF? #4752
talha-ansarii
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I’m building an automated PDF translation tool in Python that:
Extracts structured text, layout, and formatting info from a PDF using PyMuPDF,
Translates text blocks using an LLM (Google Gemini), and reinserts the translated text back into the PDF while preserving the original layout, fonts, bold/italic styles, tables, and images.
Here’s what I’ve tried so far:
I’m currently using page.get_text("dict") to extract blocks, and page.insert_htmlbox() to place the translated text at each bounding box.
This works visually well, but for large PDFs (50+ pages), the process becomes very slow mainly because I’m calling insert_htmlbox() for every block.
I also need to handle Indic and European languages, so text shaping and font fallback are important.
I’ve read discussion #4395
, which mentions that insert_htmlbox() is expensive internally and that batching might help.
Could you please guide me on:
What’s the most efficient PyMuPDF approach to replace or overlay text while preserving layout and style?
Is there a better alternative to insert_htmlbox() for inserting shaped text (e.g., for Hindi, Tamil, etc.)?
Would it be practical to construct one HTML block per page instead of per text block or is there a more native way to handle this efficiently?
Any recommended strategy for font fallback or dynamic text scaling when translated text doesn’t fit within the original bounding box?
My end goal is a scalable pipeline that can translate multi-page PDFs (50–200 pages) while maintaining 95%+ visual fidelity to the original.
Thanks in advance for your time and for maintaining this amazing library
Beta Was this translation helpful? Give feedback.
All reactions