You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary: PyMuPDF is a Python binding for the document renderer and toolkit MuPDF
12
12
Description:
13
-
Release date: March 26, 2021
13
+
Release date: April 10, 2021
14
14
15
15
Authors
16
16
=======
@@ -25,7 +25,7 @@ Description:
25
25
26
26
MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
27
27
28
-
With PyMuPDF you can access files with extensions like “.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2” or “.epub”. In addition, about 10 popular image formats can also be opened and handled like documents.
28
+
With PyMuPDF you can access files with extensions like .pdf, .xps, .oxps, .cbz, .fb2 or .epub. In addition, about 10 popular image formats can also be handled like documents: .png, .bmp, .gif, .tiff, etc..
29
29
30
30
PyMuPDF should run on all platforms that are supported by both, MuPDF and Python 3.6+. These include, but are not limited to, Windows, Mac OSX and Linux, 32-bit or 64-bit. If you can generate MuPDF on a Python supported platform, then also PyMuPDF can be used there.
31
31
@@ -59,7 +59,7 @@ Description:
59
59
License and Copyright Information
60
60
==================================
61
61
62
-
In order to comply with MuPDF’s dual licensing model, PyMuPDF has entered into an agreement with Artifex who has the right to sublicense PyMuPDF to third parties.
62
+
In order to comply with MuPDF's dual licensing model, PyMuPDF has entered into an agreement with Artifex who has the right to sublicense PyMuPDF to third parties.
63
63
64
64
PyMuPDF and MuPDF are now available under both open-source AGPL and commercial license agreements. Please read the full text of the AGPL license agreement, available in the distribution material (file COPYING) and `here <https://www.gnu.org/licenses/agpl-3.0.html>`_, to ensure that your use case complies with the guidelines of the license. If you determine you cannot meet the requirements of the AGPL, please contact `Artifex <https://artifex.com/contact/>`_ for more information regarding a commercial license.
@@ -19,9 +19,9 @@ PyMuPDF (current version 1.18.11) is a Python binding with support for [MuPDF](h
19
19
20
20
MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
21
21
22
-
With PyMuPDF you can access files with extensions like “.pdf”, “.xps”, “.oxps”, “.cbz”, “.fb2” or “.epub”. In addition, about 10 popular image formats can also be opened and handled like documents: ".png", ".jpg", ".bmp", ".tiff", etc..
22
+
With PyMuPDF you can access files with extensions like ".pdf", ".xps", ".oxps", ".cbz", ".fb2" or ".epub". In addition, about 10 popular image formats can also be handled like documents: ".png", ".jpg", ".bmp", ".tiff", etc..
23
23
24
-
> In partnership with [Artifex](https://artifex.com/), PyMuPDF is now also available for commercial licensing. This agreement has no impact on use cases, that are compliant with the open-source license AGPL. Please see the “License and Copyright” section below for additional information.
24
+
> In partnership with [Artifex](https://artifex.com/), PyMuPDF is now also available for commercial licensing. This agreement has no impact on use cases, that are compliant with the open-source license AGPL. Please see the "License and Copyright" section below for additional information.
25
25
26
26
# Usage and Documentation
27
27
For all supported document types (i.e. **_including images_**) you can
@@ -79,7 +79,7 @@ Before you can do that, you must first build MuPDF. For most platforms, the MuPD
79
79
- Now MuPDF can be generated.
80
80
81
81
* Please note that you will need the interface generator [SWIG](http://www.swig.org/) when building PyMuPDF from the sources of this repository (please refer to issue #312 for some background on this).
82
-
- PyMuPDF wheels are being generated using **SWIG v4.0.1**.
82
+
- PyMuPDF wheels are being generated using **SWIG v4.0.2**.
83
83
84
84
* If you do **not use SWIG**, please download the **sources from PyPI** - they contain sources pre-processed by SWIG, so installation should work like any other Python extension generation on your system.
otext = otext.replace(test, testn) # change the source
141
141
found_one = True
142
142
pos1 = 0 # start over
143
-
143
+
144
144
if found_one:
145
145
ofile = open(filename + ".html", "w")
146
146
ofile.write(otext)
@@ -217,7 +217,7 @@ XML
217
217
~~~
218
218
219
219
The :meth:`TextPage.extractXML` (or *Page.get_text("xml")*) version extracts text (no images) with the detail level of RAWDICT::
220
-
220
+
221
221
>>> for line in page.get_text("xml").splitlines():
222
222
print(line)
223
223
@@ -261,17 +261,19 @@ Text Extraction Flags Defaults
261
261
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
262
262
*(New in version 1.16.2)* Method :meth:`Page.get_text` supports a keyword parameter *flags* *(int)* to control the amount and the quality of extracted data. The following table shows the defaults settings (flags parameter omitted or None) for each extraction variant. If you specify flags with a value other than *None*, be aware that you must set **all desired** options. A description of the respective bit settings can be found in :ref:`TextPreserve`.
* **"json"** is handled exactly like **"dict"** and is hence left out.
276
+
* **"rawjson"** is handled exactly like **"rawdict"** and is hence left out.
275
277
* An "n/a" specification means a value of 0 and setting this bit never has any effect on the output (but an adverse effect on performance).
276
278
* If you are not interested in images when using an output variant which includes them by default, then by all means set the respective bit off: You will experience a better performance and much lower space requirements.
277
279
@@ -291,7 +293,7 @@ To show the effect of *TEXT_INHIBIT_SPACES* have a look at this example::
Copy file name to clipboardExpand all lines: docs/app4.rst
+54Lines changed: 54 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,60 @@
3
3
================================================
4
4
Appendix 4: Assorted Technical Information
5
5
================================================
6
+
This section deals with various technical topics, that are not necessarily related to each other.
7
+
8
+
------------
9
+
10
+
.. _ImageTransformation:
11
+
12
+
Image Transformation Matrix
13
+
----------------------------
14
+
Starting with version 1.18.11, the image transformation matrix is returned by some methods for text and image extraction: :meth:`Page.get_text` and :meth:`Page.get_image_bbox`.
15
+
16
+
The transformation matrix contains information about how an image was transformed to fit into the rectangle (its "boundary box" = "bbox") on some document page. By inspecting the image's bbox on the page and this matrix, one can determine for example, whether and how the image is displayed scaled or rotated on a page.
17
+
18
+
The relationship between image width and height and the bbox on a page is the following:
19
+
20
+
1. Using the original image's width and height, we can define the image rectangle ``imgrect = fitz.Rect(0, 0, width, height)`` and a "shrink matrix" ``shrink = fitz.Matrix(1/width, 0, 0, 1/height, 0, 0)``.
21
+
2. Transforming the image rectangle with its shrink matrix, will result in the unit rectangle: ``imgrect * shrink = fitz.Rect(0, 0, 1, 1)``.
22
+
3. Using the image **transformation matrix** "transform", the following steps will compute the bbox::
4. Inspecting the matrix product ``shrink * transform`` will reveal all information about what happened to the image rectangle to make it fit into the bbox on the page: rotation, scaling of its sides and translation of its origin. Let us look at an example:
29
+
30
+
>>> imginfo = page.get_images()[0] # get an image item on a page
Copy file name to clipboardExpand all lines: docs/changes.rst
+11Lines changed: 11 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,17 @@
1
1
Change Logs
2
2
===============
3
3
4
+
Changes in Version 1.18.11
5
+
---------------------------
6
+
* **Fixed** issue `#972 <https://github.com/pymupdf/PyMuPDF/issues/972>`_. Improved layout of source distribution material.
7
+
* **Fixed** issue `#962 <https://github.com/pymupdf/PyMuPDF/issues/962>`_. Stabilized Linux distribution detection for generating PyMuPDF from sources.
8
+
* **Added:** :meth:`Page.get_xobjects` delivers the result of :meth:`Document.get_page_xobjects`.
9
+
* **Added:** :meth:`Page.get_image_info` delivers meta information for all images shown on the page.
10
+
* **Added:** :meth:`Tools.mupdf_display_warnings` allows setting on / off the display of MuPDF-generated warnings. The default is off.
11
+
* **Added:** :meth:`Document.ez_save` convenience alias of :meth:`Document.save` with some different defaults.
12
+
* **Changed:** Image extractions of document pages now also contain the image's **transformation matrix**. This concerns :meth:`Page.get_image_bbox` and the DICT, JSON, RAWDICT, and RAWJSON variants of :meth:`Page.get_text`.
13
+
14
+
4
15
Changes in Version 1.18.10
5
16
---------------------------
6
17
* **Fixed** issue `#941 <https://github.com/pymupdf/PyMuPDF/issues/941>`_. Added old aliases for :meth:`DisplayList.get_pixmap` and :meth:`DisplayList.get_textpage`.
Copy file name to clipboardExpand all lines: docs/document.rst
+28-10Lines changed: 28 additions & 10 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -43,6 +43,7 @@ For details on **embedded files** refer to Appendix 3.
43
43
:meth:`Document.embfile_info` PDF only: metadata of an embedded file
44
44
:meth:`Document.embfile_names` PDF only: list of embedded files
45
45
:meth:`Document.embfile_upd` PDF only: change an embedded file
46
+
:meth:`Document.ez_save` PDF only: :meth:`Document.save` with different defaults
46
47
:meth:`Document.find_bookmark` retrieve page location after layouting
47
48
:meth:`Document.fullcopy_page` PDF only: duplicate a page
48
49
:meth:`Document.get_oc_states` PDF only: lists of OCGs in ON, OFF, RBGroups
@@ -706,7 +707,7 @@ For details on **embedded files** refer to Appendix 3.
706
707
707
708
PDF only: Return the PDF dictionary keys of the object provided by its xref number.
708
709
709
-
:arg int xref: the :data:`xref`. *(Changed in v1.18.10)* Use ``-1`` if you want to access the special dictionary "PDF trailer" (it has no identifying xref).
710
+
:arg int xref: the :data:`xref`. *(Changed in v1.18.10)* Use ``-1`` to access the special dictionary "PDF trailer" (it has no identifying xref).
710
711
711
712
:returns: a tuple of dictionary keys present in object :data:`xref`. Examples:
712
713
@@ -727,7 +728,7 @@ For details on **embedded files** refer to Appendix 3.
727
728
728
729
PDF only: Return type and value of a PDF dictionary key of an xref.
729
730
730
-
:arg int xref: the :data:`xref`. *(Changed in v1.18.10)* Use ``-1`` if you want to access the special dictionary "PDF trailer" (it has no identifying xref).
731
+
:arg int xref: the :data:`xref`. *Changed in v1.18.10:* Use ``-1`` to access the special dictionary "PDF trailer" (it has no identifying xref).
731
732
:arg str key: the desired PDF key. Must **exactly** match (case-sensitive) one of the keys contained in :meth:`Document.xref_get_keys`.
732
733
733
734
:returns: a tuple (type, value), where type is one of "xref", "array", "dict", "int", "float" "null", "bool", "float", "name", "string" or "unknown" (should not occur). Independent of "type", the value of the key is **always** formatted as a string -- see the following example -- and a faithful reflection of what is stored in the PDF. An argument like the return value can be used to modify the value of a key of :data:`xref`.
@@ -739,7 +740,7 @@ For details on **embedded files** refer to Appendix 3.
739
740
Resources = ('xref', '1296 0 R')
740
741
MediaBox = ('array', '[0 0 612 792]')
741
742
Parent = ('xref', '1301 0 R')
742
-
>>> #no the same thing for the PDF trailer:
743
+
>>> # same thing for the PDF trailer:
743
744
>>> for key in doc.xref_get_keys(-1):
744
745
print(key, "=", doc.xref_get_key(-1, key))
745
746
Type = ('name', '/XRef')
@@ -790,17 +791,19 @@ For details on **embedded files** refer to Appendix 3.
790
791
791
792
.. method:: get_page_xobjects(pno)
792
793
794
+
*(Changed in v1.18.11)*
795
+
793
796
PDF only: *(New in v1.16.13)* Return a list of all XObjects referenced by a page.
:returns: a list of (non-image) XObjects. These objects typically represent pages *embedded* (not copied) from other PDFs. For example, :meth:`Page.show_pdf_page` will create this type of object. An item of this list has the following layout: **(xref, name, invoker, bbox)**, where
801
+
:returns: a list of (non-image) XObjects. These objects typically represent pages *embedded* (not copied) from other PDFs. For example, :meth:`Page.show_pdf_page` will create this type of object. An item of this list has the following layout: ``(xref, name, invoker, bbox)``, where
799
802
800
-
* **xref** (*int*) is the XObject's :data:`xref`
801
-
* **name** (*str*) is the symbolic name to reference the XObject
802
-
* **invoker** (*int*) the :data:`xref` of the invoking XObject or zero if the page directly invokes it
803
-
* **bbox** (*tuple*) the boundary box of the XObject's location on the page **in untransformed coordinates**. To get actual, non-rotated page coordinates, multiply with the page's transformation matrix :attr:`Page.transformation_matrix`.
803
+
* **xref** (*int*) is the XObject's :data:`xref`.
804
+
* **name** (*str*) is the symbolic name to reference the XObject.
805
+
* **invoker** (*int*) the :data:`xref` of the invoking XObject or zero if the page directly invokes it.
806
+
* **bbox** (:ref:`Rect`) the boundary box of the XObject's location on the page **in untransformed coordinates**. To get actual, non-rotated page coordinates, multiply with the page's transformation matrix :attr:`Page.transformation_matrix`. *Changed in v.18.11:* the bbox is now formatted as :ref:`Rect`.
804
807
805
808
806
809
.. method:: get_page_images(pno, full=False)
@@ -1095,11 +1098,19 @@ For details on **embedded files** refer to Appendix 3.
1095
1098
1096
1099
:arg str user_pw: *(new in version 1.16.0)* set the document's user password.
1097
1100
1101
+
.. method:: ez_save(*args, **kwargs)
1102
+
1103
+
*(New in v1.18.11)*
1104
+
1105
+
PDF only: The same as :meth:`Document.save` but with the changed defaults `deflate=True, garbage=3`.
1106
+
1098
1107
.. method:: saveIncr()
1099
1108
1100
1109
PDF only: saves the document incrementally. This is a convenience abbreviation for *doc.save(doc.name, incremental=True, encryption=PDF_ENCRYPT_KEEP)*.
PDF only: Return the definition source of a PDF object.
1403
1414
1415
+
:arg int xref: the object's :data`xref`. *Changed in v1.18.10:* A value of -1 returns the PDF trailer source.
1416
+
:arg bool compressed: whether to generate a compact output with no line breaks or spaces.
1417
+
:arg bool: ascii: whether to ASCII-encode binary data.
1418
+
1419
+
:rtype: str
1420
+
:returns: The object definition source.
1421
+
1404
1422
.. method:: pdf_catalog()
1405
1423
1406
1424
*(New in version 1.16.8)*
@@ -1412,7 +1430,7 @@ For details on **embedded files** refer to Appendix 3.
1412
1430
1413
1431
*(New in version 1.16.8)*
1414
1432
1415
-
PDF only: Return the trailer source of the PDF (UTF-8), which is usually located at the PDF file's end. This is similar to :meth:`Document.xref_object` except that this object has no identifier to access it.
1433
+
PDF only: Return the trailer source of the PDF, which is usually located at the PDF file's end. This is :meth:`Document.xref_object` with an *xref* argument of -1.
0 commit comments