Skip to content

Commit

Permalink
feat(java/python): new xlang type system spec implementation (#1690)
Browse files Browse the repository at this point in the history
## What does this PR do?

This PR implements a new [type
system](https://fury.apache.org/docs/specification/fury_xlang_serialization_spec/#type-systems)
for xlang serialization between java and python.

The changes includes:
- Refine type system spec: added new types:
- named_enum: an enum whose value will be serialized as the registered
name.
  - struct: a morphic(final) type serialized by Fury Struct serializer.
- polymorphic_struct: a type which is not morphic(not final). i.e. it
don't have subclasses. Suppose we're deserializing
`List<SomeClass>`, we can save dynamic serializer dispatch if
`SomeClass` is morphic(final).
- compatible_struct: a morphic(final) type serialized by Fury compatible
Struct serializer.
- polymorphic_compatible_struct: a non-morphic(non-final) type
serialized by Fury compatible Struct serializer.
- named_struct: a `struct` whose type mapping will be encoded as a name.
- named_polymorphic_struct: a `polymorphic_struct` whose type mapping
will be encoded as a name.
- named_compatible_struct: a `compatible_struct` whose type mapping will
be encoded as a name.
- named_polymorphic_compatible_struct: a `polymorphic_compatible_struct`
whose type mapping will be encoded as a name.
  - ext: a type which will be serialized by a customized serializer.
  - polymorphic_ext: an `ext` type which is not morphic(not final).
- named_ext: an `ext` type whose type mapping will be encoded as a name.
- named_polymorphic_ext: an `polymorphic_ext` type whose type mapping
will be encoded as a name.
- Added a new XtypeResolver in java to resolve xlang types
- Support register class mapping by id. Before this PR, we only support
register class by name, which is more expensive at space/performance
cost.
- Support pass type into to resolve type ambiguation such as
`ArrayList/Object[]` in java. Users can `serialize(List.of(1, 2, ,3))`
and deserialize it into array by `deserialize(bytes, Integer[].class)`
- Refactor pyfury serialization by moving type resolver into python code
from cython, this will make debug more easy and reduce code duplciation,
it also speed serialization performance.
- golang xtype serialization test are disabled, it will be reenabled
after new type system is implemented in golang

## Related issues

<!--
Is there any related issue? Please attach here.

- #xxxx0
- #xxxx1
- #xxxx2
-->


## Does this PR introduce any user-facing change?

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?


## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->
  • Loading branch information
chaokunyang authored Dec 30, 2024
1 parent 8d2d124 commit 98efd72
Show file tree
Hide file tree
Showing 82 changed files with 4,105 additions and 3,689 deletions.
1 change: 1 addition & 0 deletions BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,7 @@ pyx_library(
),
deps = [
"//cpp/fury/util:fury_util",
"//cpp/fury/type:fury_type",
"@com_google_absl//absl/container:flat_hash_map",
],
)
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,7 @@ class SomeClass:
f3: Dict[str, str]

fury = pyfury.Fury(ref_tracking=True)
fury.register_class(SomeClass, type_tag="example.SomeClass")
fury.register_type(SomeClass, typename="example.SomeClass")
obj = SomeClass()
obj.f2 = {"k1": "v1", "k2": "v2"}
obj.f1, obj.f3 = obj, obj.f2
Expand Down
15 changes: 15 additions & 0 deletions cpp/fury/type/BUILD
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
load("@rules_cc//cc:defs.bzl", "cc_library", "cc_test")

cc_library(
name = "fury_type",
srcs = glob(["*.cc"], exclude=["*test.cc"]),
hdrs = glob(["*.h"]),
copts = ["-mavx2"], # Enable AVX2 support
linkopts = ["-mavx2"], # Ensure linker also knows about AVX2
strip_include_prefix = "/cpp",
alwayslink=True,
linkstatic=True,
deps = [
],
visibility = ["//visibility:public"],
)
153 changes: 153 additions & 0 deletions cpp/fury/type/type.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

#include <cstdint> // For fixed-width integer types

namespace fury {
enum class TypeId : int32_t {
// a boolean value (true or false).
BOOL = 1,
// a 8-bit signed integer.
INT8 = 2,
// a 16-bit signed integer.
INT16 = 3,
// a 32-bit signed integer.
INT32 = 4,
// a 32-bit signed integer which use fury var_int32 encoding.
VAR_INT32 = 5,
// a 64-bit signed integer.
INT64 = 6,
// a 64-bit signed integer which use fury PVL encoding.
VAR_INT64 = 7,
// a 64-bit signed integer which use fury SLI encoding.
SLI_INT64 = 8,
// a 16-bit floating point number.
FLOAT16 = 9,
// a 32-bit floating point number.
FLOAT32 = 10,
// a 64-bit floating point number including NaN and Infinity.
FLOAT64 = 11,
// a text string encoded using Latin1/UTF16/UTF-8 encoding.
STRING = 12,
// a data type consisting of a set of named values. Rust enum with
// non-predefined field values are not supported as an enum
ENUM = 13,
// an enum whose value will be serialized as the registered name.
NAMED_ENUM = 14,
// a morphic(final) type serialized by Fury Struct serializer. i.e. it doesn't
// have subclasses. Suppose we're
// deserializing `List<SomeClass>`, we can save dynamic serializer dispatch
// since `SomeClass` is morphic(final).
STRUCT = 15,
// a type which is not morphic(not final). i.e. it has subclasses. Suppose
// we're deserializing
// `List<SomeClass>`, we must dispatch serializer dynamically since
// `SomeClass` is polymorphic(non-final).
POLYMORPHIC_STRUCT = 16,
// a morphic(final) type serialized by Fury compatible Struct serializer.
COMPATIBLE_STRUCT = 17,
// a non-morphic(non-final) type serialized by Fury compatible Struct
// serializer.
POLYMORPHIC_COMPATIBLE_STRUCT = 18,
// a `struct` whose type mapping will be encoded as a name.
NAMED_STRUCT = 19,
// a `polymorphic_struct` whose type mapping will be encoded as a name.
NAMED_POLYMORPHIC_STRUCT = 20,
// a `compatible_struct` whose type mapping will be encoded as a name.
NAMED_COMPATIBLE_STRUCT = 21,
// a `polymorphic_compatible_struct` whose type mapping will be encoded as a
// name.
NAMED_POLYMORPHIC_COMPATIBLE_STRUCT = 22,
// a type which will be serialized by a customized serializer.
EXT = 23,
// an `ext` type which is not morphic(not final).
POLYMORPHIC_EXT = 24,
// an `ext` type whose type mapping will be encoded as a name.
NAMED_EXT = 25,
// an `polymorphic_ext` type whose type mapping will be encoded as a name.
NAMED_POLYMORPHIC_EXT = 26,
// a sequence of objects.
LIST = 27,
// an unordered set of unique elements.
SET = 28,
// a map of key-value pairs. Mutable types such as
// `list/map/set/array/tensor/arrow` are not allowed as key of map.
MAP = 29,
// an absolute length of time, independent of any calendar/timezone, as a
// count of nanoseconds.
DURATION = 30,
// a point in time, independent of any calendar/timezone, as a count of
// nanoseconds. The count is relative
// to an epoch at UTC midnight on January 1, 1970.
TIMESTAMP = 31,
// a naive date without timezone. The count is days relative to an epoch at
// UTC midnight on Jan 1, 1970.
LOCAL_DATE = 32,
// exact decimal value represented as an integer value in two's complement.
DECIMAL = 33,
// an variable-length array of bytes.
BINARY = 34,
// a multidimensional array which every sub-array can have different sizes but
// all have same type.
// only allow numeric components. Other arrays will be taken as List. The
// implementation should support the
// interoperability between array and list.
ARRAY = 35,
// one dimensional bool array.
BOOL_ARRAY = 36,
// one dimensional int16 array.
INT8_ARRAY = 37,
// one dimensional int16 array.
INT16_ARRAY = 38,
// one dimensional int32 array.
INT32_ARRAY = 39,
// one dimensional int64 array.
INT64_ARRAY = 40,
// one dimensional half_float_16 array.
FLOAT16_ARRAY = 41,
// one dimensional float32 array.
FLOAT32_ARRAY = 42,
// one dimensional float64 array.
FLOAT64_ARRAY = 43,
// an arrow [record
// batch](https://arrow.apache.org/docs/cpp/tables.html#record-batches)
// object.
ARROW_RECORD_BATCH = 44,
// an arrow [table](https://arrow.apache.org/docs/cpp/tables.html#tables)
// object.
ARROW_TABLE = 45,
BOUND = 64
};

inline bool IsNamespacedType(int32_t type_id) {
switch (static_cast<TypeId>(type_id)) {
case TypeId::NAMED_ENUM:
case TypeId::NAMED_STRUCT:
case TypeId::NAMED_POLYMORPHIC_STRUCT:
case TypeId::NAMED_COMPATIBLE_STRUCT:
case TypeId::NAMED_POLYMORPHIC_COMPATIBLE_STRUCT:
case TypeId::NAMED_EXT:
case TypeId::NAMED_POLYMORPHIC_EXT:
return true;
default:
return false;
}
}

} // namespace fury
24 changes: 24 additions & 0 deletions cpp/fury/util/buffer.h
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@

#include "fury/util/bit_util.h"
#include "fury/util/logging.h"
#include "fury/util/status.h"

namespace fury {

Expand Down Expand Up @@ -133,6 +134,29 @@ class Buffer {

inline double GetDouble(uint32_t offset) { return Get<double>(offset); }

inline Status GetBytesAsInt64(uint32_t offset, uint32_t length,
int64_t *target) {
if (length == 0) {
*target = 0;
return Status::OK();
}
if (size_ - (offset + 8) > 0) {
uint64_t mask = 0xffffffffffffffff;
uint64_t x = (mask >> (8 - length) * 8);
*target = GetInt64(offset) & x;
} else {
if (size_ - (offset + length) < 0) {
return Status::OutOfBound("buffer out of bound");
}
int64_t result = 0;
for (size_t i = 0; i < length; i++) {
result = result | ((int64_t)(data_[offset + i])) << (i * 8);
}
*target = result;
}
return Status::OK();
}

inline uint32_t PutVarUint32(uint32_t offset, int32_t value) {
if (value >> 7 == 0) {
data_[offset] = (int8_t)value;
Expand Down
10 changes: 10 additions & 0 deletions cpp/fury/util/buffer_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,16 @@ TEST(Buffer, TestVarUint) {
}
}

TEST(Buffer, TestGetBytesAsInt64) {
std::shared_ptr<Buffer> buffer;
AllocateBuffer(64, &buffer);
buffer->UnsafePut<int32_t>(0, 100);
int64_t result = -1;
EXPECT_TRUE(buffer->GetBytesAsInt64(0, 0, &result).ok());
EXPECT_EQ(result, 0);
EXPECT_TRUE(buffer->GetBytesAsInt64(0, 1, &result).ok());
EXPECT_EQ(result, 100);
}
} // namespace fury

int main(int argc, char **argv) {
Expand Down
2 changes: 1 addition & 1 deletion cpp/fury/util/logging.cc
Original file line number Diff line number Diff line change
Expand Up @@ -111,4 +111,4 @@ bool FuryLog::IsLevelEnabled(FuryLogLevel log_level) {
return log_level >= fury_severity_threshold;
}

} // namespace fury
} // namespace fury
15 changes: 10 additions & 5 deletions cpp/fury/util/status.h
Original file line number Diff line number Diff line change
Expand Up @@ -83,11 +83,12 @@ namespace fury {
enum class StatusCode : char {
OK = 0,
OutOfMemory = 1,
KeyError = 2,
TypeError = 3,
Invalid = 4,
IOError = 5,
UnknownError = 6,
OutOfBound = 2,
KeyError = 3,
TypeError = 4,
Invalid = 5,
IOError = 6,
UnknownError = 7,
};

class Status {
Expand Down Expand Up @@ -123,6 +124,10 @@ class Status {
return Status(StatusCode::OutOfMemory, msg);
}

static Status OutOfBound(const std::string &msg) {
return Status(StatusCode::OutOfMemory, msg);
}

static Status KeyError(const std::string &msg) {
return Status(StatusCode::KeyError, msg);
}
Expand Down
6 changes: 3 additions & 3 deletions docs/guide/xlang_serialization_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,8 +225,8 @@ class SomeClass2:

if __name__ == "__main__":
f = pyfury.Fury()
f.register_class(SomeClass1, type_tag="example.SomeClass1")
f.register_class(SomeClass2, type_tag="example.SomeClass2")
f.register_type(SomeClass1, typename="example.SomeClass1")
f.register_type(SomeClass2, typename="example.SomeClass2")
obj1 = SomeClass1(f1=True, f2={-1: 2})
obj = SomeClass2(
f1=obj1,
Expand Down Expand Up @@ -444,7 +444,7 @@ class SomeClass:
f3: Dict[str, str]

fury = pyfury.Fury(ref_tracking=True)
fury.register_class(SomeClass, type_tag="example.SomeClass")
fury.register_type(SomeClass, typename="example.SomeClass")
obj = SomeClass()
obj.f2 = {"k1": "v1", "k2": "v2"}
obj.f1, obj.f3 = obj, obj.f2
Expand Down
Loading

0 comments on commit 98efd72

Please sign in to comment.