A complete redesign of OCF's forecast data stack for performance and useability.
The Data Platform is a gRPC API server that provides efficient access to, and storage of, renewable energy forecast data. It has been architected to be performant under the specific workflows and data access patterns required by OCF's applications, in order to enable scaling, and to improve the developer experience when integrating with OCF's stack. With this in mind, there is a focus on not just the quality of the code, but also of the tooling surrounding the codebase.
The benefits of the Data Platform, over the current OCF stack (datamodels) include, but aren't limited to:
- Two orders of magnitude faster (milliseconds vs seconds)
- Performant at scale (tested to 50x current org scope)
- Cheaper deployment as on-the-fly calculation capability obsoletes analysis microservices
- Fully typed client implementations in Python and Typescript
- Simple to understand due to codegen of boilerplate
- Safer architecture with single, considered source of entry to database
- Unlocks greater depth of analysis with geometries, capacity limits, history and more
The Data Platform has clear separation boundaries between its components:
+-------------------------------------------------------------+
| Data Platform Server |
+-------------------+ +-------------------+
--- Clients --> | External Schema | <-- Server Impl --- | Database Schema | <-- Database
+-------------------+ +-------------------+
| |
+-------------------------------------------------------------+
The Data Platform defines a strongly typed data contract as its external interface. This is the
API that any external clients have to use to interact with the platform. The schema for this is
defined via Protocol Buffers in proto/ocf/dp.
Boilerplate code for client and server implementations is generated in the required language from
these .proto files using the protoc compiler.
Note
This is a direct analogue to the Pydantic models used in the old datamodel project.
Changes to the schema modifies the data contract, and will require client and server implementations to regenerate their bindings and update their code. As such they should be made with purpose and care.
The Data Platform can be configured to use different database backends. Each backend has a server implementation that inherits the External Schema. The currently supported backends are:
- PostgreSQL
- Dummy (a memoryless databse for quick testing)
and are selected according to the relevant environment variables (see the Configration section).
The schema for the PostgreSQL backend is defined using PostgreSQL's native SQL dialect in the
internal/server/postgres/sql/migrations directory, and access functions to the data are defined
in internal/server/postgres/sql/queries.
Boilerplate code for using these queries is generated using the sqlc tool. This generated code
provides a strongly typed interface to the database.
Note
This is a direct analogue to the SQLAlchemy models used in the old datamodel project.
Having the queries defined in SQL allows for more efficient interaction with the database, as they can be written to take advantage of the design of the database's features and be written to be optimal with regards to its indexes.
These changes can be made without having to update the data contract, and so will not require updates to clients using the Data Platform.
Note
If using PostgreSQL as a backend, it is recommended that you tune your database instance according to the specifications of said instance (available CPU and RAM etc). This will ensure optimal performance for the Data Platform server.
The Database Schema is mapped to the External Schema by implementing the server interface generated
from the Data Contract. This is done in internal/server/<database>/serverimpl.go. It isn't much
more than a conversion layer, with the business logic shared between the implemented functions and
the SQL queries.
The Data Platform gRPC API server is packaged for portability as a container. This can be run using a container orchestration tool, e.g. with Docker:
$ docker run -p 50051:50051 ghcr.io/openclimatefix/data-platformAlternatively, it can be run locally using Go. See Local Running in the Development section.
Once running, the server RPCs can be investigated using a gRPC client tool.
To connect to a backend database and have retention in the platform data, the server must be
appropriately configured via environment variables. All available options are defined via the
configuration file in cmd/server.conf.
Important
Whilst the configuration is held in a file, this is NOT intended to be overwritten or modified in order to configure the Data Platform. Configuration should always be handled via environment variables; the config file is simply provided as a version-controlled single point of reference for what those variables might be.
The available configuration may differ between versions of the Data Platform. Ensure you check the correct version of the configuration file for your deployment.
There is an example Python notebook, written with Marimo, demonstrating
how to use the Python bindings in a client to a Data Platform server. The example runs through a
data analysis workflow. To run it, ensure first that the Data Platform Server is running on
localhost:50051 (see Getting Started); and that the python bindings have been
generated (see Generating Code). Then use
uvx to run the notebook:
$ make gen.proto.python
$ uvx marimo edit --headless --sandbox examples/python-notebook/example.py For ease, the above process is wrapped in a Makefile target:
$ make run.notebookThis project requires the Go Toolchain to be installed.
Note
This project uses Go modules for dependency management. Ensure that your PATH environment
variable has been updated to include the Go binary installation location, as per the instructions
linked above, otherwise you may see errors.
Clone the repository, then run
$ make initThis will fetch the dependencies, and install the git hooks required for development.
Important
Since this project is uses lots of generated code, these hooks are vital to keep this generated
code up to date, and as such running make init is a necessary step towards a smooth development
experience.
The server can be run locally with no database connection via a fake database implementation via a Make target. This is recommended as it will ensure that code generation is up to date and that the running version has been embedded into the built binary.
$ make runThis will start the Data Platform API GRPC's server on localhost:50051. The RPCs can then be
investigated using a tool such as grpcurl or
grpcui. In this testing mode, the data returned by the server is
entirely generated and has little bearing on the request objects themselves.
There is also an example Docker compose file in examples/docker-compose.yml, which runs the Data
Platform API server in a container, backed by Postgres, and which also includes a GRPC UI for
testing.
Unit tests can be run using make test. Benchmarks can be run using make bench.
Both of these utilise TestContainers,
so ensure you meet their
general system requirements.
In order to make changes to the SQL queries, or add a new Database migration, you will need to
add or modify the relevant .sql files in the sql directory. Then, regenerate the Go library
code to reflect these changes. This can be done using
$ make gen
This will populate the internal/server/postgres/gen directory with language-specific bindings
for implementations of server and client code. Next, update the serverimpl.go file for the given
database to use the newly generated code, and ensure the test suite passes. Since the Data Platform
container automatically migrates the database on startup, simply re-deploying the container will
propagate the changes to your deployment environment.
In order to change the Data Contract, you will need to modify the .proto files in the proto
directory, and regenerate the code. GRPC client/server interfaces - and boilerplate code - gets
generated from these Protocol Buffer definitions. The make gen target already handles generating
the go code used internall in the application, placing generated code in internal/gen.
Language-specific client/server bindings for external applications are generated as part of the CI pipeline, but can also be generated manually, e.g. for python
$ make gen.proto.pythonThis places the generated code in gen/python. See the Makefile for more external targets.
Complexity analysis of Data Platform vs old datamodels & metrics (scc)
Data Platform:
───────────────────────────────────────────────────────────────────────────────
Language Files Lines Blanks Comments Code Complexity
───────────────────────────────────────────────────────────────────────────────
SQL 7 1288 75 566 647 6
Go 5 2353 222 191 1940 253
Shell 4 108 11 17 80 11
YAML 4 224 34 2 188 0
Protocol Buffers 2 418 82 79 257 0
Makefile 1 88 20 3 65 7
Markdown 1 143 33 0 110 0
───────────────────────────────────────────────────────────────────────────────
Total 24 4622 477 858 3287 277
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $94,239
Estimated Schedule Effort (organic) 5.61 months
Estimated People Required (organic) 1.49
───────────────────────────────────────────────────────────────────────────────
Processed 11480397 bytes, 11.480 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
Datamodels & Metrics:
───────────────────────────────────────────────────────────────────────────────
Language Files Lines Blanks Comments Code Complexity
───────────────────────────────────────────────────────────────────────────────
Python 190 23776 3213 3119 17444 508
YAML 9 294 30 10 254 0
Markdown 6 825 222 0 603 0
CSV 3 978 0 0 978 0
Mako 3 74 21 0 53 0
TOML 3 196 28 20 148 2
Dockerfile 2 56 18 12 26 2
INI 2 213 46 99 68 0
Plain Text 2 12 0 0 12 0
Autoconf 1 1 0 0 1 0
License 1 21 4 0 17 0
Makefile 1 23 4 5 14 0
───────────────────────────────────────────────────────────────────────────────
Total 223 26469 3586 3265 19618 512
───────────────────────────────────────────────────────────────────────────────
Estimated Cost to Develop (organic) $615,010
Estimated Schedule Effort (organic) 11.43 months
Estimated People Required (organic) 4.78
───────────────────────────────────────────────────────────────────────────────
Processed 909596 bytes, 0.910 megabytes (SI)
───────────────────────────────────────────────────────────────────────────────
(Produced via $ scc --exclude-dir=".git,examples,proto/buf,proto/google". Data may be out of date.)