Skip to content

9.1.0#595

Open
seperman wants to merge 12 commits intomasterfrom
dev
Open

9.1.0#595
seperman wants to merge 12 commits intomasterfrom
dev

Conversation

@seperman
Copy link
Copy Markdown
Member

@seperman seperman commented May 4, 2026

  • Multi processing for big nested objects

seperman added 12 commits April 27, 2026 11:27
…sult with the inner

     namedtuple, dropping the outer container.
     Fixed by updating the namedtuple in its actual parent when nested, while preserving root-level
     namedtuple behavior.
  2. Tuple deltas using iterable opcodes could silently do nothing for insert/delete-only changes.
     Fixed by writing the transformed tuple back instead of reconstructing the original tuple.
  3. Applying a delta with both moved and added iterable items could mutate the delta’s own
     internal diff data.
     Fixed by copying the added-items mapping before inserting temporary move placeholders.
  4. Removing multiple dictionary items with complex keys could crash during path sorting.
     Fixed by correcting the None check and falling back to string comparison when same-type path
     elements are still not orderable.

  Regression tests were added for each case, and the full Delta test suite passes.
…aises immediately instead of

  going through _raise_or_log(). Also added full-path preflight validation in
  _get_elements_and_details() so the set_item_added path introduced in the last commit cannot
  silently skip malicious dunder paths.
  Changed:

  - Replaced the homegrown linked-list LFU implementation in deepdiff/lfucache.py with a small
    DistanceCache wrapper over native cachebox.LRUCache.
  - Kept LFUCache = DistanceCache and DummyLFU compatibility names so internal imports keep
    working.
  - Updated deepdiff/diff.py cache hot paths to avoid contains + get double lookups.
  - Moved cachebox>=5.2,<6 into core dependencies in pyproject.toml, since DeepDiff now imports it
    unconditionally.
  - Updated tests/test_lfucache.py to validate the new bounded distance-cache behavior instead of
    LFU frequency internals.

  Benchmark result from the same 1,000,000 operation local microbenchmark:

  - Old homegrown LFUCache: 1.901302s
  - Direct cachebox.LFUCache: 5.846142s
  - Direct cachebox.LRUCache: 0.537102s
  - New DistanceCache wrapper: 1.153068s

  So I used cachebox.LRUCache, not cachebox.LFUCache, because cachebox’s LFU policy is slower for
  this workload.
  - helper.py: relaxed add_to_frozen_set to Any (callers use both int and str ids); changed
  type_in_type_group/type_is_subclass_of_type_group to accept Iterable[Type]
  - delta.py: added elem is not None guard, narrowed tag type, type-ignored namedtuple
  _replace/summarize
  - diff.py: typed _compare_in_order index params as Optional[int] with early return; fixed real
  bug len(other.indexes > 1) → len(other.indexes) > 1; cast UUID arg to str
  - distance.py: handled iterable_compare_func None check; widened max_/replace_inf_with to float;
  switched memoryview-incompatible strings to str
  - path.py: fixed real bug obj.append(_guess_type(...), next_element) (misplaced paren); coerced
  setattr name to str
  - serialization.py: type-ignored namedtuple _fields access
  Code:
  - deepdiff/_multiprocessing.py (new) — MPConfig, normalize_mp_config, picklability check,
  _distance_worker (module-level for spawn), compute_distances_parallel with stable job-index
  ordering.
  - deepdiff/diff.py — three new opt-in params, normalized into self._mp_config, propagated via
  _parameters. New _maybe_compute_pair_distances_parallel helper. One extra dict lookup in
  _get_most_in_common_pairs_in_iterables before the existing serial
  _get_rough_distance_of_hashed_objs call.

  Tests: tests/test_multiprocessing.py (23 tests) — config validation, 10× serial-vs-parallel
  determinism on nested dicts/repeated items/ties/sets/exclude_paths/ignore_string_case/custom
  hasher, unpickleable-callback fallback, no-nested-pool guarantee. Full suite: 1149 passed, 10
  skipped, 0 regressions. Pyright clean.

  Doc: docs/multi_processing.md now opens with an "Implementation Status" section listing what's
  in, the code locations, and what's deferred (subtickets #2/#4/#5/#6 extended matrix/#7) with the
  reasons each is held back.

  Two notable design points worth flagging:
  1. Workers are spawned without _shared_parameters, so they think they're root and would purge
  _distance_cache/hashes mid-call. Fixed by passing cache_purge_level=0 to the worker DeepDiff
  (commented in _distance_worker).
  2. Sanitization sets both multiprocessing=False and _mp_config=MPConfig(enabled=False, ...)
  because recursive DeepDiff with _parameters=... skips the constructor's normalization branch.
…eria are met:

  - ✅ Parallel _create_hashtable lands behind the existing multiprocessing=True opt-in
  - ✅ Serial and parallel results match for large lists of dicts, lists of lists, sets, repeated
  items, nested mixed structures
  - ✅ Both report_repetition=False and report_repetition=True covered
  - ✅ Result order matches serial output (verified via 10× repeat-comparison)
  - ✅ Pickling fallback (unpickleable hasher) tested end-to-end
  - ✅ Full suite green (1160 passed, 10 skipped); pyright clean
…e_diffs_parallel — workers

  compute fresh DeepDiff per pair and ship back [(report_type, leaf), ...].
  - deepdiff/diff.py::_diff_iterable_with_deephash: paired _diff(change_level, ...) calls in both
  report_repetition branches are deferred into a queue and dispatched at the end via
  _dispatch_subtree_jobs. Inline serial behavior unchanged when mp is off.
  - deepdiff/diff.py: three new helpers — _subtree_parallel_safe (gates against custom_operators /
  *_obj_callback* / ignore_order_func), _rebase_subtree_leaf (splices the worker's leaf chain onto
  a fresh copy of change_level and clears path caches), _dispatch_subtree_jobs
  (parallel-or-serial-in-job-order, plus parent-side _skip_this re-filter for exclude_paths).
  - deepdiff/helper.py: NotPresent / Unprocessed / Skipped / NotHashed got __reduce__ so the
  singleton sentinels survive pickle round-trips. Without this, change.t2 is not notpresent (used
  by TextResult._from_tree_default) silently flips for any DiffLevel that travels through a worker.
  - 9 new tests in tests/test_multiprocessing.py covering paired-subtree determinism, multiple
  changes per pair, dict add/remove, type changes, report_repetition=True, exclude_paths re-filter,
   custom_operators/exclude_obj_callback fallback, and direct unit tests.
  - docs/multi_processing.md: updated Implementation Status, Code locations, and partial Subticket
  #4 deferred items (_diff_dict shared keys, ordered-pair path, _iterable_opcodes propagation).
…t drift; one run is

  mathematically sufficient, two is cheap insurance).
  - Dropped 13 redundant determinism cases — kept one per behavior (tied distances, repetition,
  exclude_paths, subtree rebasing, subtree add/remove keys, no recursive spawn, threshold gating).
  - Marked the 10 spawn-heavy tests @pytest.mark.slow so they only run under --runslow.
  - Kept all the helper/config unit tests in the fast path — they test the same fallback logic
  without paying spawn cost.
  - New helpers _extract_worker_stats and _aggregate_worker_stats.
  - _distance_worker and _subtree_diff_worker now return a stats delta as a third tuple element.
  - compute_distances_parallel and compute_subtree_diffs_parallel now return (result,
  aggregated_stats) instead of bare result.

  Code (deepdiff/diff.py)
  - New stats keys WORKER_DIFF_COUNT, WORKER_PASSES_COUNT, WORKER_DISTANCE_CACHE_HIT_COUNT,
  WORKER_BATCH_COUNT added to _stats init.
  - New helper _merge_worker_stats (sums counters, OR-merges limit flags).
  - _maybe_compute_pair_distances_parallel and _dispatch_subtree_jobs unpack the new orchestrator
  return shape and merge.

  Tests
  - New classes TestWorkerStatsUnit, TestStatsKeys, TestWorkerStatsAggregationSlow (8 tests).
  - Updated TestSubtreeParallelHelper.test_empty_jobs_returns_empty_list for new return shape.
  - Updated expected_stats dicts in tests/test_cache.py (3 tests) and tests/test_ignore_order.py (2
   tests) with the four new zeroed keys.
  - Full suite: 1148 pass, 35 multiprocessing pass with --runslow.

  Doc (docs/multi_processing.md)
  - Phase 4 implementation status, code locations, test summary, and Subticket #5 removed from "Not
   yet implemented".
  - Added 21 tests to tests/test_multiprocessing.py across 5 new classes covering:
  report_repetition=False, sets/frozensets, pickleable custom hasher, ignore_string_case /
  ignore_numeric_type_changes / ignore_string_type_changes, include_paths, exclude_regex_paths,
  namedtuple/__slots__/__dict__ objects, group_by, generators, numpy (importorskip), pydantic
  (importorskip), verbose_level=2, to_dict() equality, closure iterable_compare_func, and worker
  exception propagation via an __reduce__ that survives pickle.dumps but raises on unpickle.
  - All 56 tests in test_multiprocessing.py pass; full suite (1126 tests + 10 skips) still green.

  Phase 6 — Subticket #7 (benchmarks)
  - Added benchmarks/multiprocessing_bench.py — three ignore_order=True workloads (paired_subtree,
  distance_loop, large_nested_dicts), --workers/--scale/--quick/--only flags, asserts parallel ==
  serial on every row, non-zero exit on divergence.
  - Verified locally: paired_subtree at scale=400 gets ~1.3× with 2 workers; quick scales show
  spawn overhead dominating (which is exactly why DEFAULT_THRESHOLD = 64 exists).

  Doc: docs/multi_processing.md updated with Phase 5 and Phase 6 status sections, code locations,
  and a tightened "Not yet implemented" entry that now only flags the _prep_iterable/_prep_dict
  deeper recursion, the _diff_dict/ordered-pair extension of #4, and threshold tuning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant