Skip to content

Commit 50c1def

Browse files
committed
man: document cutmarks briefly
1 parent 812e485 commit 50c1def

File tree

1 file changed

+78
-0
lines changed

1 file changed

+78
-0
lines changed

doc/casync.rst

+78
Original file line numberDiff line numberDiff line change
@@ -157,6 +157,8 @@ General options:
157157
--store=PATH The primary chunk store to use
158158
--extra-store=<PATH> Additional chunk store to look for chunks in
159159
--chunk-size=<[MIN:]AVG[:MAX]> The minimal/average/maximum number of bytes in a chunk
160+
--cutmark=CUTMARK Specify a cutmark
161+
--cutmark-delta-bytes=BYTES Maximum bytes to shift cut due to cutmark
160162
--digest=<DIGEST> Pick digest algorithm (sha512-256 or sha256)
161163
--compression=<COMPRESSION> Pick compression algorithm (zstd, xz or gzip)
162164
--seed=<PATH> Additional file or directory to use as seed
@@ -291,3 +293,79 @@ excluded:
291293
unconditionally take precedence over lines not marked like this. Moreover,
292294
lines prefixed with ``!`` also cancel the effect of patterns in
293295
``.caexclude`` files placed in directories further up the tree.
296+
297+
Cutmarks
298+
--------
299+
300+
``casync`` cuts the stream to serialize into chunks of an average size (as
301+
specified with ``--chunk-size=``), determining cut points using the ``buzhash``
302+
rolling hash function and a modulo test. Frequently, cut points determined that
303+
way are at slightly inconvenient locations: in the midle of objects serialized
304+
in the stream rather then before or after them, thus needlessly exploding
305+
changes to individual objects into more than one chunk. To optimize this
306+
**cutmarks** may be configured. These are byte sequences ``casync`` (up to 8
307+
bytes in length) automatically detects in the data stream and that should be
308+
considered particularly good cutpoints. When cutmarks are defined the chunking
309+
algorithm will slightly move the cut point between two chunks to match a
310+
cutmark if one has recently been seen in the serialization stream.
311+
312+
Cutmarks may be specified with the ``--cutmark=`` option. It takes a cutmark
313+
specification in the format ``VALUE:MASK+OFFSET`` or ``VALUE:MASK-OFFSET``. The
314+
first part, the value indicates the byte sequence to detect in hexadecimal
315+
digits, up to 8 bytes (thus 16 characters) in length. Following the colon a
316+
bitmask (also in hexadecimal) may be specified of the same size. Every 8 byte
317+
sequence at every 1 byte granularity stream position is tested against the
318+
value. If all bits indicated in the mask match a cutmark is found. The third
319+
part of the specification indicates where to place the cutmark specifically
320+
relative to the the end of the 8 byte sequence. Specify ``-8`` to cut
321+
immediately before the cutmark sequence, and ``+0`` right after. The offset
322+
(along with its ``+`` or ``-`` character) may be omitted, in which case the
323+
offset is assumed to be zero, i.e. the cut is done right after the
324+
sequence. The mask (along with its ``:`` character) may also be omitted, in
325+
which case it is assumed to be ``FFFFFFFFFFFFFFFF``, i.e. all
326+
bits on, matching the full specified byte sequence. In order to match shorter
327+
byte sequence (for example to adapt the tool to some specific file format using
328+
shorter object or section markers) simply specificy a shorter mask value and
329+
correct the offset value.
330+
331+
Examples:
332+
333+
--cutmark=123456789ABCDEF0
334+
335+
336+
This defines a cutmark to be the 8 byte sequence 0x12, 0x34, 0x56, 0x78, 0x9A,
337+
0xBC, 0xDE, 0xF0, and the cut is placed right after the last byte, i.e. after the
338+
0xF0.
339+
340+
341+
--cutmark=C0FFEE:FFFFFF-5
342+
343+
344+
This defines a cutmark to be the 3 byte sequence 0xC0, 0xFF, 0xEE and the cut is
345+
placed right after the last byte, i.e. after the 0xEE.
346+
347+
--cutmark=C0DECAFE:FFFFFFFF-8
348+
349+
350+
This defines a cutmark to be the 4 byte sequence 0xC0, 0xDE, 0xCA, 0xFE and the
351+
cut is placed right before the first byte, i.e. before the 0xC0.
352+
353+
When operating on the file system layer (i.e. when creating `.caidx` files),
354+
the implicit cutmark of ``--cutmark=51bb5beabcfa9613+8`` is used, to increase
355+
the chance that cutmarks are placed right before each serialized file.
356+
357+
Multiple cutmarks may be defined on the same operation, simply specify
358+
``--cutmark=`` multiple times. The parameter also takes the specifical values
359+
``yes`` and ``no``. If the latter any implicit cutmarks are turned off, in
360+
particular the implicit cutmark used when generating ``.caidx`` files above.
361+
362+
``casync`` will honour cutmarks only within the immediate vicinity of the cut
363+
point the modulo test suggested. By default this a 16K window before the
364+
calculated cut point. This value may be altered using the
365+
``--cutmark-delta-max=`` setting.
366+
367+
Any configured cutmark (and the selected ``--cutmark-delta-max=`` value) is
368+
also stored in the ``.caidx`` or ``.caibx`` file to ensure that such an index
369+
file contains sufficient data for an extracting client to properly use an
370+
existing file system tree (or block device) as seed while applying the same
371+
chunking logic as the original image.

0 commit comments

Comments
 (0)