Skip to content

Performance related bench marks for Serialization an Deserialization #88

@matrixbegins

Description

@matrixbegins

This is more of a question:
I was trying to do a benchmark for my upcoming project between JSON vs. proto Vs. Avro data formats:
I am using orjson for Json Serialization and de-serialization, for ProtoBuf I am using pure-protobuf
Here's what the code looks like:

Data classes:

from faker import Faker
from dataclasses import dataclass
from pure_protobuf.dataclasses_ import field, message

Faker.seed(0)
fake = Faker()


@message
@dataclass
class Head():
    msgId: str = field(1)
    msgCode: str = field(2)
    guid: str = field(3)
    src: str = field(4)
    ts: int = field(5)

    @staticmethod
    def fakeMe():
        return Head(fake.md5(),
                fake.pystr(min_chars=5, max_chars=5),
                fake.ean(length=13),
                fake.pystr(min_chars=1, max_chars=1),
                int(time()*1000)
            )

@message
@dataclass
class Message():
    head: Head = field(1)
    # data: Data = field(2)
    status: bool = field(2)

    def fakeMe(self):
        self.head = Head.fakeMe()
        # self.data = Data.fakeMe()
        self.bool = fake.pybool()
        return self

Running Serialization and Deserialization:

import time, sys, orjson, message_pb2
from object_gen import create_dummy_obj
from dto.device_message import Message          # this is my data calss

def measure_serialize_deserialize(obj, format):
  ser_fun = ser_obj.get(format)

  deser_fun = deser_obj.get(format)

  # serialize and measure time
  start_time = time.time()
  ser_data = ser_fun(obj)
  time_taken_ser = time.time() - start_time
  mem_ser = sys.getsizeof(ser_data)

  # deserialize and measure time
  start_time = time.time()
  deser_data = deser_fun(ser_data, Message)
  time_taken_deser = time.time() - start_time

  return (time_taken_ser, time_taken_deser, mem_ser)


def serialize_json(obj):
  return orjson.dumps(obj)



def deserialize_json(byteArr, klass):
  return orjson.loads(byteArr)


def serialize_proto(obj):
  return obj.dumps()


def deserialize_proto(byteArr, klass):
  return klass.loads(byteArr)


def serialize_avro(obj):
  pass

def deserialize_avro(byteArr, klass):
  pass


ser_obj = {
  "J": serialize_json,
  "P": serialize_proto,
  "A": serialize_avro,
}

deser_obj = {
  "J": deserialize_json,
  "P": deserialize_proto,
  "A": deserialize_avro,
}


def runBenchMarks(numberOfMsgs, format):

  ser_times = []
  deser_times = []
  memory_usage_plain = []
  memory_usage_ser = []

  for i in range(1, numberOfMsgs + 1):
    # create new object basis on the format.
    obj = create_dummy_obj(format)
    memory_usage_plain.append(sys.getsizeof(obj))
    ser_time, deser_time, mem_ser =  measure_serialize_deserialize(obj, format)
    ser_times.append(ser_time)
    deser_times.append(deser_time)
    memory_usage_ser.append(mem_ser)

  # return values
  return ser_times, deser_times, memory_usage_plain, memory_usage_ser

After running the program for 1000 messages using pure-proto I found:

Running benchmark for 1000 samples and format = P


 =========== Serialization METRICES (Time in ms) ====================
Total Time taken for serialization: 15.65241813659668
Avg Time taken for serialization: 0.01565241813659668
Min Time taken for serialization: 0.014781951904296875
Max Time taken for serialization: 0.04220008850097656


 =========== Deserialization METRICES (Time in ms) ====================
Total Time taken for deserialization: 21.908044815063477
Avg Time taken for deserialization: 0.021908044815063477
Min Time taken for deserialization: 0.0209808349609375
Max Time taken for deserialization: 0.051975250244140625


 =========== MEMORY METRICES (Bytes) ====================
Total memory utilized by Plain objects: 103000
Avg memory utilized by Plain objects: 103.0
Min memory utilized: 103
Max memory utilized: 103

Total memory utilized by serialized objects: 103000
Avg memory utilized by serialized objects: 103.0
Min memory utilized: 103
Max memory utilized: 103

Then I ran the same code for JSON:

Running benchmark for 1000 samples and format = J


 =========== Serialization METRICES (Time in ms) ====================
Total Time taken for serialization: 0.9558200836181641
Avg Time taken for serialization: 0.0009558200836181642
Min Time taken for serialization: 0.0
Max Time taken for serialization: 0.20194053649902344


 =========== Deserialization METRICES (Time in ms) ====================
Total Time taken for deserialization: 1.4314651489257812
Avg Time taken for deserialization: 0.0014314651489257812
Min Time taken for deserialization: 0.0007152557373046875
Max Time taken for deserialization: 0.029087066650390625


 =========== MEMORY METRICES (Bytes) ====================
Total memory utilized by Plain objects: 182518
Avg memory utilized by Plain objects: 182.518
Min memory utilized: 182
Max memory utilized: 183

Total memory utilized by serialized objects: 182518
Avg memory utilized by serialized objects: 182.518
Min memory utilized: 182
Max memory utilized: 183

If you seen total time for serialization and Deserialization its way less than compared to JSON. Ideally this should not be the case if I am correct and understand protobuf correctly. Could it be because we are compiling proto schema everytime we are calling obj.dumps() ?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions