Conversation
Member
seperman
commented
May 4, 2026
- Multi processing for big nested objects
…sult with the inner
namedtuple, dropping the outer container.
Fixed by updating the namedtuple in its actual parent when nested, while preserving root-level
namedtuple behavior.
2. Tuple deltas using iterable opcodes could silently do nothing for insert/delete-only changes.
Fixed by writing the transformed tuple back instead of reconstructing the original tuple.
3. Applying a delta with both moved and added iterable items could mutate the delta’s own
internal diff data.
Fixed by copying the added-items mapping before inserting temporary move placeholders.
4. Removing multiple dictionary items with complex keys could crash during path sorting.
Fixed by correcting the None check and falling back to string comparison when same-type path
elements are still not orderable.
Regression tests were added for each case, and the full Delta test suite passes.
…aises immediately instead of going through _raise_or_log(). Also added full-path preflight validation in _get_elements_and_details() so the set_item_added path introduced in the last commit cannot silently skip malicious dunder paths.
Changed:
- Replaced the homegrown linked-list LFU implementation in deepdiff/lfucache.py with a small
DistanceCache wrapper over native cachebox.LRUCache.
- Kept LFUCache = DistanceCache and DummyLFU compatibility names so internal imports keep
working.
- Updated deepdiff/diff.py cache hot paths to avoid contains + get double lookups.
- Moved cachebox>=5.2,<6 into core dependencies in pyproject.toml, since DeepDiff now imports it
unconditionally.
- Updated tests/test_lfucache.py to validate the new bounded distance-cache behavior instead of
LFU frequency internals.
Benchmark result from the same 1,000,000 operation local microbenchmark:
- Old homegrown LFUCache: 1.901302s
- Direct cachebox.LFUCache: 5.846142s
- Direct cachebox.LRUCache: 0.537102s
- New DistanceCache wrapper: 1.153068s
So I used cachebox.LRUCache, not cachebox.LFUCache, because cachebox’s LFU policy is slower for
this workload.
- helper.py: relaxed add_to_frozen_set to Any (callers use both int and str ids); changed type_in_type_group/type_is_subclass_of_type_group to accept Iterable[Type] - delta.py: added elem is not None guard, narrowed tag type, type-ignored namedtuple _replace/summarize - diff.py: typed _compare_in_order index params as Optional[int] with early return; fixed real bug len(other.indexes > 1) → len(other.indexes) > 1; cast UUID arg to str - distance.py: handled iterable_compare_func None check; widened max_/replace_inf_with to float; switched memoryview-incompatible strings to str - path.py: fixed real bug obj.append(_guess_type(...), next_element) (misplaced paren); coerced setattr name to str - serialization.py: type-ignored namedtuple _fields access
Code: - deepdiff/_multiprocessing.py (new) — MPConfig, normalize_mp_config, picklability check, _distance_worker (module-level for spawn), compute_distances_parallel with stable job-index ordering. - deepdiff/diff.py — three new opt-in params, normalized into self._mp_config, propagated via _parameters. New _maybe_compute_pair_distances_parallel helper. One extra dict lookup in _get_most_in_common_pairs_in_iterables before the existing serial _get_rough_distance_of_hashed_objs call. Tests: tests/test_multiprocessing.py (23 tests) — config validation, 10× serial-vs-parallel determinism on nested dicts/repeated items/ties/sets/exclude_paths/ignore_string_case/custom hasher, unpickleable-callback fallback, no-nested-pool guarantee. Full suite: 1149 passed, 10 skipped, 0 regressions. Pyright clean. Doc: docs/multi_processing.md now opens with an "Implementation Status" section listing what's in, the code locations, and what's deferred (subtickets #2/#4/#5/#6 extended matrix/#7) with the reasons each is held back. Two notable design points worth flagging: 1. Workers are spawned without _shared_parameters, so they think they're root and would purge _distance_cache/hashes mid-call. Fixed by passing cache_purge_level=0 to the worker DeepDiff (commented in _distance_worker). 2. Sanitization sets both multiprocessing=False and _mp_config=MPConfig(enabled=False, ...) because recursive DeepDiff with _parameters=... skips the constructor's normalization branch.
…eria are met: - ✅ Parallel _create_hashtable lands behind the existing multiprocessing=True opt-in - ✅ Serial and parallel results match for large lists of dicts, lists of lists, sets, repeated items, nested mixed structures - ✅ Both report_repetition=False and report_repetition=True covered - ✅ Result order matches serial output (verified via 10× repeat-comparison) - ✅ Pickling fallback (unpickleable hasher) tested end-to-end - ✅ Full suite green (1160 passed, 10 skipped); pyright clean
…e_diffs_parallel — workers compute fresh DeepDiff per pair and ship back [(report_type, leaf), ...]. - deepdiff/diff.py::_diff_iterable_with_deephash: paired _diff(change_level, ...) calls in both report_repetition branches are deferred into a queue and dispatched at the end via _dispatch_subtree_jobs. Inline serial behavior unchanged when mp is off. - deepdiff/diff.py: three new helpers — _subtree_parallel_safe (gates against custom_operators / *_obj_callback* / ignore_order_func), _rebase_subtree_leaf (splices the worker's leaf chain onto a fresh copy of change_level and clears path caches), _dispatch_subtree_jobs (parallel-or-serial-in-job-order, plus parent-side _skip_this re-filter for exclude_paths). - deepdiff/helper.py: NotPresent / Unprocessed / Skipped / NotHashed got __reduce__ so the singleton sentinels survive pickle round-trips. Without this, change.t2 is not notpresent (used by TextResult._from_tree_default) silently flips for any DiffLevel that travels through a worker. - 9 new tests in tests/test_multiprocessing.py covering paired-subtree determinism, multiple changes per pair, dict add/remove, type changes, report_repetition=True, exclude_paths re-filter, custom_operators/exclude_obj_callback fallback, and direct unit tests. - docs/multi_processing.md: updated Implementation Status, Code locations, and partial Subticket #4 deferred items (_diff_dict shared keys, ordered-pair path, _iterable_opcodes propagation).
…t drift; one run is mathematically sufficient, two is cheap insurance). - Dropped 13 redundant determinism cases — kept one per behavior (tied distances, repetition, exclude_paths, subtree rebasing, subtree add/remove keys, no recursive spawn, threshold gating). - Marked the 10 spawn-heavy tests @pytest.mark.slow so they only run under --runslow. - Kept all the helper/config unit tests in the fast path — they test the same fallback logic without paying spawn cost.
- New helpers _extract_worker_stats and _aggregate_worker_stats. - _distance_worker and _subtree_diff_worker now return a stats delta as a third tuple element. - compute_distances_parallel and compute_subtree_diffs_parallel now return (result, aggregated_stats) instead of bare result. Code (deepdiff/diff.py) - New stats keys WORKER_DIFF_COUNT, WORKER_PASSES_COUNT, WORKER_DISTANCE_CACHE_HIT_COUNT, WORKER_BATCH_COUNT added to _stats init. - New helper _merge_worker_stats (sums counters, OR-merges limit flags). - _maybe_compute_pair_distances_parallel and _dispatch_subtree_jobs unpack the new orchestrator return shape and merge. Tests - New classes TestWorkerStatsUnit, TestStatsKeys, TestWorkerStatsAggregationSlow (8 tests). - Updated TestSubtreeParallelHelper.test_empty_jobs_returns_empty_list for new return shape. - Updated expected_stats dicts in tests/test_cache.py (3 tests) and tests/test_ignore_order.py (2 tests) with the four new zeroed keys. - Full suite: 1148 pass, 35 multiprocessing pass with --runslow. Doc (docs/multi_processing.md) - Phase 4 implementation status, code locations, test summary, and Subticket #5 removed from "Not yet implemented".
- Added 21 tests to tests/test_multiprocessing.py across 5 new classes covering: report_repetition=False, sets/frozensets, pickleable custom hasher, ignore_string_case / ignore_numeric_type_changes / ignore_string_type_changes, include_paths, exclude_regex_paths, namedtuple/__slots__/__dict__ objects, group_by, generators, numpy (importorskip), pydantic (importorskip), verbose_level=2, to_dict() equality, closure iterable_compare_func, and worker exception propagation via an __reduce__ that survives pickle.dumps but raises on unpickle. - All 56 tests in test_multiprocessing.py pass; full suite (1126 tests + 10 skips) still green. Phase 6 — Subticket #7 (benchmarks) - Added benchmarks/multiprocessing_bench.py — three ignore_order=True workloads (paired_subtree, distance_loop, large_nested_dicts), --workers/--scale/--quick/--only flags, asserts parallel == serial on every row, non-zero exit on divergence. - Verified locally: paired_subtree at scale=400 gets ~1.3× with 2 workers; quick scales show spawn overhead dominating (which is exactly why DEFAULT_THRESHOLD = 64 exists). Doc: docs/multi_processing.md updated with Phase 5 and Phase 6 status sections, code locations, and a tightened "Not yet implemented" entry that now only flags the _prep_iterable/_prep_dict deeper recursion, the _diff_dict/ordered-pair extension of #4, and threshold tuning.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.