This is more of a question:
I was trying to do a benchmark for my upcoming project between JSON vs. proto Vs. Avro data formats:
I am using orjson for Json Serialization and de-serialization, for ProtoBuf I am using pure-protobuf
Here's what the code looks like:
Data classes:
from faker import Faker
from dataclasses import dataclass
from pure_protobuf.dataclasses_ import field, message
Faker.seed(0)
fake = Faker()
@message
@dataclass
class Head():
msgId: str = field(1)
msgCode: str = field(2)
guid: str = field(3)
src: str = field(4)
ts: int = field(5)
@staticmethod
def fakeMe():
return Head(fake.md5(),
fake.pystr(min_chars=5, max_chars=5),
fake.ean(length=13),
fake.pystr(min_chars=1, max_chars=1),
int(time()*1000)
)
@message
@dataclass
class Message():
head: Head = field(1)
# data: Data = field(2)
status: bool = field(2)
def fakeMe(self):
self.head = Head.fakeMe()
# self.data = Data.fakeMe()
self.bool = fake.pybool()
return self
Running Serialization and Deserialization:
import time, sys, orjson, message_pb2
from object_gen import create_dummy_obj
from dto.device_message import Message # this is my data calss
def measure_serialize_deserialize(obj, format):
ser_fun = ser_obj.get(format)
deser_fun = deser_obj.get(format)
# serialize and measure time
start_time = time.time()
ser_data = ser_fun(obj)
time_taken_ser = time.time() - start_time
mem_ser = sys.getsizeof(ser_data)
# deserialize and measure time
start_time = time.time()
deser_data = deser_fun(ser_data, Message)
time_taken_deser = time.time() - start_time
return (time_taken_ser, time_taken_deser, mem_ser)
def serialize_json(obj):
return orjson.dumps(obj)
def deserialize_json(byteArr, klass):
return orjson.loads(byteArr)
def serialize_proto(obj):
return obj.dumps()
def deserialize_proto(byteArr, klass):
return klass.loads(byteArr)
def serialize_avro(obj):
pass
def deserialize_avro(byteArr, klass):
pass
ser_obj = {
"J": serialize_json,
"P": serialize_proto,
"A": serialize_avro,
}
deser_obj = {
"J": deserialize_json,
"P": deserialize_proto,
"A": deserialize_avro,
}
def runBenchMarks(numberOfMsgs, format):
ser_times = []
deser_times = []
memory_usage_plain = []
memory_usage_ser = []
for i in range(1, numberOfMsgs + 1):
# create new object basis on the format.
obj = create_dummy_obj(format)
memory_usage_plain.append(sys.getsizeof(obj))
ser_time, deser_time, mem_ser = measure_serialize_deserialize(obj, format)
ser_times.append(ser_time)
deser_times.append(deser_time)
memory_usage_ser.append(mem_ser)
# return values
return ser_times, deser_times, memory_usage_plain, memory_usage_ser
After running the program for 1000 messages using pure-proto I found:
Running benchmark for 1000 samples and format = P
=========== Serialization METRICES (Time in ms) ====================
Total Time taken for serialization: 15.65241813659668
Avg Time taken for serialization: 0.01565241813659668
Min Time taken for serialization: 0.014781951904296875
Max Time taken for serialization: 0.04220008850097656
=========== Deserialization METRICES (Time in ms) ====================
Total Time taken for deserialization: 21.908044815063477
Avg Time taken for deserialization: 0.021908044815063477
Min Time taken for deserialization: 0.0209808349609375
Max Time taken for deserialization: 0.051975250244140625
=========== MEMORY METRICES (Bytes) ====================
Total memory utilized by Plain objects: 103000
Avg memory utilized by Plain objects: 103.0
Min memory utilized: 103
Max memory utilized: 103
Total memory utilized by serialized objects: 103000
Avg memory utilized by serialized objects: 103.0
Min memory utilized: 103
Max memory utilized: 103
Then I ran the same code for JSON:
Running benchmark for 1000 samples and format = J
=========== Serialization METRICES (Time in ms) ====================
Total Time taken for serialization: 0.9558200836181641
Avg Time taken for serialization: 0.0009558200836181642
Min Time taken for serialization: 0.0
Max Time taken for serialization: 0.20194053649902344
=========== Deserialization METRICES (Time in ms) ====================
Total Time taken for deserialization: 1.4314651489257812
Avg Time taken for deserialization: 0.0014314651489257812
Min Time taken for deserialization: 0.0007152557373046875
Max Time taken for deserialization: 0.029087066650390625
=========== MEMORY METRICES (Bytes) ====================
Total memory utilized by Plain objects: 182518
Avg memory utilized by Plain objects: 182.518
Min memory utilized: 182
Max memory utilized: 183
Total memory utilized by serialized objects: 182518
Avg memory utilized by serialized objects: 182.518
Min memory utilized: 182
Max memory utilized: 183
If you seen total time for serialization and Deserialization its way less than compared to JSON. Ideally this should not be the case if I am correct and understand protobuf correctly. Could it be because we are compiling proto schema everytime we are calling obj.dumps() ?
This is more of a question:
I was trying to do a benchmark for my upcoming project between JSON vs. proto Vs. Avro data formats:
I am using
orjsonfor Json Serialization and de-serialization, for ProtoBuf I am usingpure-protobufHere's what the code looks like:
Data classes:
Running Serialization and Deserialization:
After running the program for 1000 messages using
pure-protoI found:Then I ran the same code for JSON:
If you seen total time for serialization and Deserialization its way less than compared to JSON. Ideally this should not be the case if I am correct and understand protobuf correctly. Could it be because we are compiling proto schema everytime we are calling
obj.dumps()?