Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): chunk based map serialization for python #2038

Merged

Conversation

chaokunyang
Copy link
Collaborator

@chaokunyang chaokunyang commented Feb 5, 2025

What does this PR do?

Implement chunk based map serialization for python using cython.

Note:

Related issues

Closes #1935

Does this PR introduce any user-facing change?

  • Does this PR introduce any public API change?
  • Does this PR introduce any binary protocol compatibility change?

Benchmark

Here is the benchamrk script:

def test_map_benchmark(data=None, dictsize=50, repeat=100000):
    import timeit, pickle, pyfury

    fury = pyfury.Fury(language=pyfury.Language.PYTHON, ref_tracking=False)
    dict0 = data or {i: i * 2 for i in range(dictsize)}
    bytes0 = fury.serialize(dict0)
    bytes1 = pickle.dumps(dict0)
    print(f"fury serialize map of size {len(dict0)}, payload size {len(bytes0)}")
    print(f"pickle serialize map of size {len(dict0)}, payload size {len(bytes1)}")
    print(f"fury serialize map of size {len(dict0)}", timeit.timeit(lambda : fury.serialize(dict0), number=repeat))
    print(f"pickle serialize map of size {len(dict0)}", timeit.timeit(lambda : pickle.dumps(dict0), number=repeat))
    print(f"fury deserialize map of size {len(dict0)}", timeit.timeit(lambda : fury.deserialize(bytes0), number=repeat))
    print(f"pickle deserialize map of size {len(dict0)}", timeit.timeit(lambda : pickle.loads(bytes1), number=repeat))

With this PR, the serialized size is only 1/2 of pickle,
test result:

In [7]: test_map_benchmark(dictsize=50, repeat=1000000)
fury serialize map of size 50, payload size 126
pickle serialize map of size 50, payload size 216
fury serialize map of size 50 2.600001722999991
pickle serialize map of size 50 2.703825038000005
fury deserialize map of size 50 3.2978402969999934
pickle deserialize map of size 50 3.6489022370000015

In [8]: test_map_benchmark(dictsize=500, repeat=100000)
fury serialize map of size 500, payload size 1917
pickle serialize map of size 500, payload size 2632
fury serialize map of size 500 1.541724773999988
pickle serialize map of size 500 2.1854165999999964
fury deserialize map of size 500 3.2613812140000107
pickle deserialize map of size 500 3.2642077769999958

In [23]: test_map_benchmark(data={f"k{i}": f"v{i}" for i in range(10)}, repeat=1000000)
fury serialize map of size 10, payload size 88
pickle serialize map of size 10, payload size 116
fury serialize map of size 10 2.053245253
pickle serialize map of size 10 1.5431892400000606
fury deserialize map of size 10 2.8904618450000044
pickle deserialize map of size 10 2.2623522280000543

In [22]: test_map_benchmark(data={f"k{i}": f"v{i}" for i in range(1000)}, repeat=100000)
fury serialize map of size 1000, payload size 11801
pickle serialize map of size 1000, payload size 13798
fury serialize map of size 1000 7.018782786999964
pickle serialize map of size 1000 21.388066090000052
fury deserialize map of size 1000 19.090073496999935
pickle deserialize map of size 1000 20.72099072399999

before this PR, the serialized size is 50% larger than pickle:

In [6]: test_map_benchmark(dictsize=50, repeat=1000000)
fury serialize map of size 50, payload size 322
pickle serialize map of size 50, payload size 216
fury serialize map of size 50 4.886074129999997
pickle serialize map of size 50 2.684925058999994
fury deserialize map of size 50 5.766612550999994
pickle deserialize map of size 50 3.482006009999992

In [7]: test_map_benchmark(dictsize=500, repeat=100000)
fury serialize map of size 500, payload size 3909
pickle serialize map of size 500, payload size 2632
fury serialize map of size 500 3.6878661510000086
pickle serialize map of size 500 2.0822324780000088
fury deserialize map of size 500 5.649835711999998
pickle deserialize map of size 500 3.401463585000016

In [8]: test_map_benchmark(data={f"k{i}": f"v{i}" for i in range(10)}, repeat=1000000)
fury serialize map of size 10, payload size 104
pickle serialize map of size 10, payload size 116
fury serialize map of size 10 1.5266061640000146
pickle serialize map of size 10 1.7377313819999927
fury deserialize map of size 10 2.830370420999998
pickle deserialize map of size 10 2.3650116949999926

In [9]: test_map_benchmark(data={f"k{i}": f"v{i}" for i in range(1000)}, repeat=100000)
fury serialize map of size 1000, payload size 13785
pickle serialize map of size 1000, payload size 13798
fury serialize map of size 1000 5.561600682000005
pickle serialize map of size 1000 15.757341811999993
fury deserialize map of size 1000 19.507720968
pickle deserialize map of size 1000 21.805054765999955

@pandalee99 pandalee99 self-requested a review February 6, 2025 07:12
Copy link
Contributor

@pandalee99 pandalee99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,This code implementation is pretty cool!

@pandalee99 pandalee99 merged commit f2d38ab into apache:main Feb 6, 2025
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python][Protocol] Chunk by chunk predictive map serialization protocol
2 participants