How can I efficiently replace translated text in a PDF while preserving original layout, font styles, and tables using PyMuPDF? #4752

talha-ansarii · 2025-10-17T09:32:02Z

talha-ansarii
Oct 17, 2025

I’m building an automated PDF translation tool in Python that:

Extracts structured text, layout, and formatting info from a PDF using PyMuPDF,

Translates text blocks using an LLM (Google Gemini), and reinserts the translated text back into the PDF while preserving the original layout, fonts, bold/italic styles, tables, and images.

Here’s what I’ve tried so far:

I’m currently using page.get_text("dict") to extract blocks, and page.insert_htmlbox() to place the translated text at each bounding box.

This works visually well, but for large PDFs (50+ pages), the process becomes very slow mainly because I’m calling insert_htmlbox() for every block.

I also need to handle Indic and European languages, so text shaping and font fallback are important.

I’ve read discussion #4395
, which mentions that insert_htmlbox() is expensive internally and that batching might help.

Could you please guide me on:

What’s the most efficient PyMuPDF approach to replace or overlay text while preserving layout and style?

Is there a better alternative to insert_htmlbox() for inserting shaped text (e.g., for Hindi, Tamil, etc.)?

Would it be practical to construct one HTML block per page instead of per text block or is there a more native way to handle this efficiently?

Any recommended strategy for font fallback or dynamic text scaling when translated text doesn’t fit within the original bounding box?

My end goal is a scalable pipeline that can translate multi-page PDFs (50–200 pages) while maintaining 95%+ visual fidelity to the original.

Thanks in advance for your time and for maintaining this amazing library

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How can I efficiently replace translated text in a PDF while preserving original layout, font styles, and tables using PyMuPDF? #4752

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How can I efficiently replace translated text in a PDF while preserving original layout, font styles, and tables using PyMuPDF? #4752

Uh oh!

talha-ansarii Oct 17, 2025

Replies: 0 comments

talha-ansarii
Oct 17, 2025