pdf2msgpack is designed to efficiently dump information from a PDF in a portable format which is easy to consume without heavyweight PDF libraries.
At the moment it dumps glyphs (a rectangle and text) and path information.
pdf2msgpack is used internally at The Sensible Code company. It's still quite young and the output might change significantly.
To configure and build, run:
./waf configure
./waf build
There is an accompanying docker file which enables building a static binary.
To build the static binary, just ensure submodules are checked out and
invoke make.
git submodule update --init
make
./pdf2msgpack <pdf-file>
pdf2msgpack writes its output to msgpack, which is a convenient, fast serialization format with libraries available in many languages.
Here is a description of the wire format.
At the top level the document is not a list but consecutive objects written to the stream.
document (consecutive objects): <wire version : int> <metadata> <page>...
metadata (dict): {
"Pages": int,
"FileName": str,
"XFA": dict,
"EmbeddedFiles": [<embedfile>...],
"FontInfo": [<fontinfo : dict>...],
(other string fields supplied in PDF),
}
page (dict): {
"Size": [<width : int>, <height : int>],
"Glyphs": [<glyph>...],
"Paths": [<pathitem>...],
"Bitmap": [ [<width : int>, <height : int>], <pixeldata : uint8> ]
}
embedfile (list):
[ <name : str>, <description : str>, <mime-type : str>,
<created-date : str>, <modified-date : str>, <content : bin>]
pathitem (list): [<type : (EO_FILL|STROKE|FILL|SET_STROKE_COLOR|
SET_STROKE_WIDTH|SET_FILL_COLOR)>,
[<pathdata>...]]
pathdata (list): depending on pathitem type
EO_FILL, FILL, STROKE: pathcoord
SET_FILL_COLOR: [<r : uint8>, <g : uint8>, <b : uint8>]
SET_STROKE_WIDTH: <width : float>
pathcoord (2 list or 6 list):
point: [<x : float>, <y : float>]
quadratic curve: [<a : float>, <b : float>, <c : float>, <d : float>, <e : float>, <f : float>]
fontinfo (dict): {<Name> <Type> <Encoding> <Embedded> <Subset> <ToUnicode>}
Some of the options affect the wire format generated by the command.
--meta-only: changesdocumentto skip generation ofpageobjects.--bitmap: generate grayscale bitmap inpageobject (otherwise absent)--xfa: generate XML Forms Architecture (XFA) data inmetadataobject (otherwise nil)--embedded-files: generate embedded files inmetadataobject (otherwise nil)
pdf2msgpack is licensed under the GPLv2.