Skip to content

Commit

Permalink
Merge pull request #141 from haskell-works/decode-optimisation
Browse files Browse the repository at this point in the history
3x faster, support logical types better, simpler interface
  • Loading branch information
AlexeyRaga authored Mar 25, 2020
2 parents 8e40518 + 1918f10 commit 12d6360
Show file tree
Hide file tree
Showing 110 changed files with 4,171 additions and 4,512 deletions.
4 changes: 2 additions & 2 deletions .vscode/tasks.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"label": "Build",
"type": "shell",
"command": "bash",
"args": ["-lc", "cabal new-build && echo 'Done'"],
"args": ["-lc", "cabal v2-build --enable-tests --enable-benchmarks && echo 'Done'"],
"group": {
"kind": "build",
"isDefault": true
Expand Down Expand Up @@ -37,7 +37,7 @@
"label": "Test",
"type": "shell",
"command": "bash",
"args": ["-lc", "cabal new-test --enable-tests --enable-benchmarks --test-show-details=direct && echo 'Done'"],
"args": ["-lc", "cabal v2-test --enable-tests --enable-benchmarks --test-show-details=direct && echo 'Done'"],
"group": {
"kind": "test",
"isDefault": true
Expand Down
289 changes: 181 additions & 108 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,143 +8,216 @@ and encoding Avro data structures. Avro can be thought of as a serialization
format and RPC specification which induces three separable tasks:

* *Serialization*/*Deserialization* - This library has been used "in anger" for:
- Deserialization of avro container files
- Serialization/deserialization Avro messages to/from Kafka topics
* Deserialization of avro container files
* Serialization/deserialization Avro messages to/from Kafka topics
* *RPC* - There is currently no support for Avro RPC in this library.

This library also provides functionality for automatically generating Avro-related data types and instances from Avro schemas (using TemplateHaskell).
## Generating code from Avro schema

# Quickstart
The preferred method to use Avro is to be "schema first". </br>
This library supports this idea by providing the ability to generate all the necessary entries (types, class instances, etc.) from Avro schemas.

This library provides the following conversions between Haskell types and Avro types:
```haskell
import Data.Avro
import Data.Avro.Deriving (deriveAvroFromByteString, r)

| Haskell type | Avro type |
|:------------------|:--------------------------------|
| () | "null" |
| Bool | "boolean" |
| Int, Int64 | "long" |
| Int32 | "int" |
| Double | "double" |
| Text | "string" |
| ByteString | "bytes" |
| Maybe a | ["null", "a"] |
| Either a b | ["a", "b"] |
| Identity a | ["a"] |
| Map Text a | {"type": "map", "value": "a"} |
| Map String a | {"type": "map", "value": "a"} |
| HashMap Text a | {"type": "map", "value": "a"} |
| HashMap String a | {"type": "map", "value": "a"} |
| [a] | {"type": "array", "value": "a"} |

User defined data types should provide `HasAvroSchema`/`ToAvro`/`FromAvro` instances to be encoded/decoded to/from Avro.

## Defining types and `HasAvroSchema` / `FromAvro` / `ToAvro` manually

Typically these imports are useful:
```
import Data.Avro
import Data.Avro.Schema as S
import qualified Data.Avro.Types as AT
deriveAvroFromByteString [r|
{
"name": "Person",
"type": "record",
"fields": [
{ "name": "fullName", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "gender",
"type": { "type": "enum", "symbols": ["Male", "Female"] }
},
{ "name": "ssn", "type": ["null", "string"] }
]
}
|]
```

Assuming there is a data type to be encoded/decoded from/to Avro:
```
data Gender = Male | Female deriving (Eq, Ord, Show, Enum)
This code will generate the following entries:

```haskell
data Gender = GenderMale | GenderFemale

schema'Gender :: Schema
schema'Gender = ...

data Person = Person
{ fullName :: Text
, age :: Int32
, gender :: Gender
, ssn :: Maybe Text
} deriving (Show, Eq)
```
{ personFullName :: Text
, personAge :: Int32
, personGender :: Gender,
, personSsn :: Maybe Text
}

Avro schema for this type can be defined as:
schema'Person :: Schema
schema'Person = ...
```
genderSchema :: Schema
genderSchema = mkEnum "Gender" [] Nothing Nothing ["Male", "Female"]
personSchema :: Schema
personSchema =
Record "Person" Nothing [] Nothing Nothing
[ fld "name" String Nothing
, fld "age" Int Nothing
, fld "gender" genderSchema Nothing
, fld "ssn" (mkUnion $ Null :| [String]) Nothing
]
where
fld nm ty def = Field nm [] Nothing Nothing ty def

instance HasAvroSchema Person where
schema = pure personSchema
```
As well as all the useful instances for these types: `Eq`, `Show`, `Generic`, noticing `HasAvroSchema`, `FromAvro` and `ToAvro`.

`ToAvro` instance for `Person` can be defined as:
```
instance ToAvro Person where
schema = pure personSchema
toAvro p = record personSchema
[ "name" .= fullName p
, "age" .= age p
, "gender" .= gender p
, "ssn" .= ssn p
]
```
See `Data.Avro.Deriving` module for more options like code generation from Avro schemas in files, specifying strictness and prefixes, etc.

`FromAvro` instance for `Person` can be defined as:
```
instance FromAvro Person where
fromAvro (AT.Record _ r) =
Person <$> r .: "name"
<*> r .: "age"
<*> r .: "gender"
<*> r .: "ssn"
fromAvro r = badValue r "Person"
```
## Using Avro with existing Haskell types

## Defining types and `HasAvroSchema` / `FromAvro` / `ToAvro` "automatically"
This library provides functionality to derive Haskell data types and `HasAvroSchema`/`FromAvro`/`ToAvro` instances "automatically" from already existing Avro schemas (using TemplateHaskell).
**Note**: This is an advanced topic. Prefer generating from schemas unless it is required to make Avro work with manually defined Haskell types.

### Examples
In this section we assume that the following Haskell type is manually defined:

`deriveAvro` will derive data types, `FromAvro` and `ToAvro` instances from a provided Avro schema file:
```haskell
data Person = Person
{ fullName :: Text
, age :: Int32
, ssn :: Maybe Text
} deriving (Eq, Show, Generic)
```
{-# LANGUAGE TemplateHaskell #-}
{-# LANGUAGE DeriveGeneric #-}
import Data.Avro.Deriving

deriveAvro "schemas/contract.avsc"
```
For a Haskell type to be encodable to Avro it should have `ToAvro` instance, and to be decodable from Avro it should have `FromAvro` instance.

Similarly, `deriveFromAvro` can be used to only derive data types and `FromAvro`, but not `ToAvro` instances.
There is also `HasAvroSchema` class that is useful to have an instance of (although it is not required strictly speaking).

If you prefer defining Avro schema in Haskell and not in `avsc`, then `deriveAvro'` can be used instead of `deriveAvro`.

### Conventions
When Haskell data types are generated, these conventions are followed:
### Creating a schema

- Type and field names are "sanitized":
all the charachers except `[a-z,A-Z,',_]` are removed from names
- Field names are prefixed with the name of the record they are declared in.
A schema can still be generated using TH:

For example, if Avro schema defines `Person` record as:
```
{ "type": "record",
```haskell
schema'Person :: Schema
schema'Person = $(makeSchemaFromByteString [r|
{
"name": "Person",
"type": "record",
"fields": [
{ "name": "name", "type": "string"}
{ "name": "fullName", "type": "string" },
{ "name": "age", "type": "int" },
{ "name": "ssn", "type": ["null", "string"] }
]
}
|])
```

then generated Haskell type will look like:
Alternatively schema can be defined manually:

```haskell
import Data.Avro
import Data.Avro.Schema.Schema (mkUnion)

schema'Person :: Schema
schema'Person =
Record "Person" [] Nothing Nothing
[ fld "fullName" (String Nothing) Nothing
, fld "age" (Int Nothing) Nothing
, fld "ssn" (mkUnion $ Null :| [(String Nothing)]) Nothing
]
where
fld nm ty def = Field nm [] Nothing Nothing ty def
```
data Person = Person
{ personName :: Text
} deriving (Show, Eq)

---
**NOTE**: When Schema is created separately to a data type there is no way to guarantee that the schema actually matches the type. It will be up to a developer to make sure of that.

Prefer generating data types with `Data.Avro.Deriving` when possible.

---

### Instantiating `FromAvro`

When working with `FromAvro` directly it is important to understand the difference between `Schema` and `ReadSchema`.

`Schema` (as in the example above) is just a regular data schema for an Avro type.

`ReadSchema` is a similar type, but it is capable of captuting and resolving differences between "_writer_ schema" and "_reader_ schema". See [Specification](https://avro.apache.org/docs/current/spec.html#Schema+Resolution) to learn more about schema resolution and de-conflicting.

`FromAvro` class requires `ReaderSchema` because with Avro it is possible to read data with a different schema compared to the schema that was used for writing this data.

`ReadSchema` can be obtained by converting an existing `Schema` with `readSchemaFromSchema` function, or by actually deconflicting two schemas using `deconflict` function.

Another **important fact** is that field's values in Avro payload are written and read _in order_ with how these fields are defined in the schema.

This fact can be exploited in writing `FromAvro` instance for `Person`:

```haskell
import Data.Avro.Encoding.FromAvro (FromAvro (..))
import qualified Data.Avro.Encoding.FromAvro as FromAvro

instance FromAvro Person where
fromAvro (FromAvro.Record _schema vs) = Person
<$> fromAvro (vs Vector.! 0)
<*> fromAvro (vs Vector.! 1)
<*> fromAvro (vs Vector.! 2)
```

Fields resolution by name can be performed here (since we have reference to the schema). But in this case it is simpler (and faster) to exploit the fact that the order of values is known and to access required values by their positions.

### Instantiating `ToAvro`

`ToAvro` class is defined as

```haskell
class ToAvro a where
toAvro :: Schema -> a -> Builder
```

### Limitations
Two-parts unions like `["null", "MyType"]` or `["MyType", "YourType"]` are supported (as Haskell's `Maybe MyType` and `Either MyType YourType`), but multi-parts unions are currently _not_ supported.
It is not due to any fundamental problems but because it has not been done yet. PRs are welcomed! :)
# TODO
Please see the [TODO](TODO)
A `Schema` is provided to help with disambiguating how exactly the specified value should be encoded.

For example, `UTCTime` can be encoded as milliseconds or as microseconds depending on schema's _logical type_ accordig to [Specification](https://avro.apache.org/docs/current/spec.html#Logical+Types):

```haskell
instance ToAvro UTCTime where
toAvro s = case s of
Long (Just TimestampMicros) ->
toAvro @Int64 s . fromIntegral . utcTimeToMicros

Long (Just TimestampMillis)) ->
toAvro @Int64 s . fromIntegral . utcTimeToMillis
```

`ToAvro` instance for `Person` data type from the above could look like:

```haskell
import Data.Avro.Encoding.ToAvro (ToAvro(..), record, ((.=)))

instance ToAvro Person where
toAvro schema value =
record schema
[ "fullName" .= fullName value
, "age" .= age value
, "ssn" .= ssn value
]
```

`record` helper function is responsible for propagaing individual fields' schemas (found in the provided `schema`) when `toAvro`'ing nested values.

## Type mapping

Full list can be found in `ToAvro` and `FromAvro` modules.

This library provides the following conversions between Haskell types and Avro types:

| Haskell type | Avro type |
|:------------------|:--------------------------------------------------------|
| () | "null" |
| Bool | "boolean" |
| Int, Int64 | "long" |
| Int32 | "int" |
| Double | "double" |
| Text | "string" |
| ByteString | "bytes" |
| Maybe a | ["null", "a"] |
| Either a b | ["a", "b"] |
| Identity a | ["a"] |
| Map Text a | { "type": "map", "value": "a" } |
| Map String a | { "type": "map", "value": "a" } |
| HashMap Text a | { "type": "map", "value": "a" } |
| HashMap String a | { "type": "map", "value": "a" } |
| [a] | { "type": "array", "value": "a" } |
| UTCTime | { "type": "long", "logicalType": "timestamp-millis" } |
| UTCTime | { "type": "long", "logicalType": "timestamp-micros" } |
| DiffTime | { "type": "int", "logicalType": "time-millis" } |
| DiffTime | { "type": "long", "logicalType": "time-micros" } |
| Day | { "type": "int", "logicalType": "date" } |
| UUID | { "type": "string", "logicalType": "uuid" } |

User defined data types should provide `HasAvroSchema` / `ToAvro` / `FromAvro` instances to be encoded/decoded to/from Avro.
2 changes: 1 addition & 1 deletion TODO
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
- Test round trip of example .avro containers
- Test round trip of each type.
- Data.Avro level To/From Avro classes
- Data.Avro.{Encode,Decode} level EncodeAvro/GetAvro classes
- Data.Avro.{Encode,Decode} level ToAvro/GetAvro classes
- Test 'deconflict' for all pathalogical deconflications
* In-comment in-haddock tutorials and examples.
* Deal with 'order'?
Expand Down
Loading

0 comments on commit 12d6360

Please sign in to comment.