@@ -157,6 +157,8 @@ General options:
157
157
--store=PATH The primary chunk store to use
158
158
--extra-store=<PATH> Additional chunk store to look for chunks in
159
159
--chunk-size=<[MIN:]AVG[:MAX]> The minimal/average/maximum number of bytes in a chunk
160
+ --cutmark=CUTMARK Specify a cutmark
161
+ --cutmark-delta-bytes=BYTES Maximum bytes to shift cut due to cutmark
160
162
--digest=<DIGEST> Pick digest algorithm (sha512-256 or sha256)
161
163
--compression=<COMPRESSION> Pick compression algorithm (zstd, xz or gzip)
162
164
--seed=<PATH> Additional file or directory to use as seed
@@ -291,3 +293,79 @@ excluded:
291
293
unconditionally take precedence over lines not marked like this. Moreover,
292
294
lines prefixed with ``! `` also cancel the effect of patterns in
293
295
``.caexclude `` files placed in directories further up the tree.
296
+
297
+ Cutmarks
298
+ --------
299
+
300
+ ``casync `` cuts the stream to serialize into chunks of an average size (as
301
+ specified with ``--chunk-size= ``), determining cut points using the ``buzhash ``
302
+ rolling hash function and a modulo test. Frequently, cut points determined that
303
+ way are at slightly inconvenient locations: in the midle of objects serialized
304
+ in the stream rather then before or after them, thus needlessly exploding
305
+ changes to individual objects into more than one chunk. To optimize this
306
+ **cutmarks ** may be configured. These are byte sequences ``casync `` (up to 8
307
+ bytes in length) automatically detects in the data stream and that should be
308
+ considered particularly good cutpoints. When cutmarks are defined the chunking
309
+ algorithm will slightly move the cut point between two chunks to match a
310
+ cutmark if one has recently been seen in the serialization stream.
311
+
312
+ Cutmarks may be specified with the ``--cutmark= `` option. It takes a cutmark
313
+ specification in the format ``VALUE:MASK+OFFSET `` or ``VALUE:MASK-OFFSET ``. The
314
+ first part, the value indicates the byte sequence to detect in hexadecimal
315
+ digits, up to 8 bytes (thus 16 characters) in length. Following the colon a
316
+ bitmask (also in hexadecimal) may be specified of the same size. Every 8 byte
317
+ sequence at every 1 byte granularity stream position is tested against the
318
+ value. If all bits indicated in the mask match a cutmark is found. The third
319
+ part of the specification indicates where to place the cutmark specifically
320
+ relative to the the end of the 8 byte sequence. Specify ``-8 `` to cut
321
+ immediately before the cutmark sequence, and ``+0 `` right after. The offset
322
+ (along with its ``+ `` or ``- `` character) may be omitted, in which case the
323
+ offset is assumed to be zero, i.e. the cut is done right after the
324
+ sequence. The mask (along with its ``: `` character) may also be omitted, in
325
+ which case it is assumed to be ``FFFFFFFFFFFFFFFF ``, i.e. all
326
+ bits on, matching the full specified byte sequence. In order to match shorter
327
+ byte sequence (for example to adapt the tool to some specific file format using
328
+ shorter object or section markers) simply specificy a shorter mask value and
329
+ correct the offset value.
330
+
331
+ Examples:
332
+
333
+ --cutmark=123456789ABCDEF0
334
+
335
+
336
+ This defines a cutmark to be the 8 byte sequence 0x12, 0x34, 0x56, 0x78, 0x9A,
337
+ 0xBC, 0xDE, 0xF0, and the cut is placed right after the last byte, i.e. after the
338
+ 0xF0.
339
+
340
+
341
+ --cutmark=C0FFEE:FFFFFF-5
342
+
343
+
344
+ This defines a cutmark to be the 3 byte sequence 0xC0, 0xFF, 0xEE and the cut is
345
+ placed right after the last byte, i.e. after the 0xEE.
346
+
347
+ --cutmark=C0DECAFE:FFFFFFFF-8
348
+
349
+
350
+ This defines a cutmark to be the 4 byte sequence 0xC0, 0xDE, 0xCA, 0xFE and the
351
+ cut is placed right before the first byte, i.e. before the 0xC0.
352
+
353
+ When operating on the file system layer (i.e. when creating `.caidx ` files),
354
+ the implicit cutmark of ``--cutmark=51bb5beabcfa9613+8 `` is used, to increase
355
+ the chance that cutmarks are placed right before each serialized file.
356
+
357
+ Multiple cutmarks may be defined on the same operation, simply specify
358
+ ``--cutmark= `` multiple times. The parameter also takes the specifical values
359
+ ``yes `` and ``no ``. If the latter any implicit cutmarks are turned off, in
360
+ particular the implicit cutmark used when generating ``.caidx `` files above.
361
+
362
+ ``casync `` will honour cutmarks only within the immediate vicinity of the cut
363
+ point the modulo test suggested. By default this a 16K window before the
364
+ calculated cut point. This value may be altered using the
365
+ ``--cutmark-delta-max= `` setting.
366
+
367
+ Any configured cutmark (and the selected ``--cutmark-delta-max= `` value) is
368
+ also stored in the ``.caidx `` or ``.caibx `` file to ensure that such an index
369
+ file contains sufficient data for an extracting client to properly use an
370
+ existing file system tree (or block device) as seed while applying the same
371
+ chunking logic as the original image.
0 commit comments