Pravega OLAP Integration
Dell Technologies takes charge of the open-source project Pravega. This is an infrastructure that serves as a storage system that implements data streams to store/serve data. These data streams are made up of sections which contain events. These are sets of bytes in a stream that represent some sort of data. Pravega is effective at storing/ingesting these due to its data streams being consistent, durable, elastic, and append-only.
Pravega stores data in a row-oriented manner - allows for all data points relating to one object to be stored in the same data block. This is beneficial for queries needing to read/manipulate an entire object, but it is slow to analyze large amounts of data. When we want to process events via big data analytics queries, efficiency is poor due to the row-oriented structure of Pravega. A column-oriented processing engine in which columns store similar data points for distinct objects withina block would allow for a quicker analysis of data points, as well as the compression of columns. Without ingesting Pravega events into a proper big data analytics engine, queries against the events are very slow.
Utilization of Java to create a plugin to enable the automatic ingestion of Pravega's data streams to be processed by Apache Druid.
TODO:
Pravega's data streams need to be read and transformed from a row oriented manner in order to be written into Apache Druid in a column oriented manner.
Adopt the Kafka indexing service to Druid to operate with the Pravega API, thus creating a Pravega indexing service to Druid .The goal is to examine the Kafka, Pulsar, and Kinesis-indexing-service setup then consults with the Pravega Java documentation to gain an insight into what the equivalent implementation of Pravega will be and how it can be transformed into a Pravega indexing service that facilitates automatic ignition of streams from the Pravega to Druid.
Pravega
OLAP Database (Apache Druid)
Java
SQL
Guice
IntellJ
CI/CD
JUnit
Driud
Mavin
AWS EC2
AWS S3 Strorage
Zookeeper
Pravega is a new storage a new abstraction – a stream for continuously generated and unbounded data. In comparison to a distributed messaging system such as Kafka and Pulsar, Pravega provides a multitude of futures that are useful for modern-day data-intensive applications. While Kafka and Pulsar support transactions, long-term retention, and event stream they luck the necessary futures like durable by default, auto-scaling, ingestion of large data, and many other futures. Pravega offers all the essential futures. However, Pravega is not an analytic engine hence it cannot process the data it ingests. Our plugin will integrate Pravega with Druid and enable the automatic ingestion of data streams into an OLAP database, such that a user can perform log-based analytics against the events in their streams.
We will develop the ingestion plugin mostly using the client APIs from Prevega and Apache Druid. Note that theres are distributive mode for both Prevega and Apache Druid but we will only develop the plugin in our local environment and tested it with standalone mode of Prevega and Apache Druid. Please see below the prequisites and installation steps for detail.
-
Prevega Distributed Mode Prerequisites
- HDFS
- Setup an HDFS storage cluster running HDFS version 2.7+. HDFS is used as Tier 2 storage and must have sufficient capacity to store contents of all streams. The storage cluster is recommended to be run alongside Pravega on separate nodes.
- Java
- Install the latest Java 8 from java.oracle.com. Packages are available for all major operating systems.
- Zookeeper
- Pravega requires Zookeeper 3.5.1-alpha+. At least 3 Zookeeper nodes are recommended for a quorum. No special configuration is required for Zookeeper but it is recommended to use a dedicated cluster for Pravega.
- This specific version of Zookeeper can be downloaded from Apache at zookeeper-3.5.1-alpha.tar.gz.
- For installing Zookeeper see the Getting Started Guide.
- Bookkeeper
- Pravega requires Bookkeeper 4.4.0+. At least 3 Bookkeeper servers are recommended for a quorum.
- This specific version of Bookkeeper can be downloaded from Apache at bookkeeper-server-4.4.0-bin.tar.gz.
- For installing Bookkeeper see the Getting Started Guide. Some specific Pravega instructions are shown below. All sets assuming being run from the bookkeeper-server-4.4.0 directory.
** Bookkeeper Configuration**
The following configuration options should be changed in the
conf/bk_server.conf file
.
# Comma separated list of <zp-ip>:<port> for all ZK servers zkServers=localhost:2181 # Alternatively specify a different path to the storage for /bk journalDirectory=/bk/journal ledgerDirectories=/bk/ledgers indexDirectories=/bk/index zkLedgersRootPath=/pravega/bookkeeper/ledgers
- Initializing Zookeeper paths
- The following paths need to be created in Zookeeper. From the
zookeeper-3.5.1-alpha
directory on the Zookeeper servers run: bin/zkCli.sh -server $ZK_URL create /pravega
bin/zkCli.sh -server $ZK_URL create /pravega/bookkeeper
- The following paths need to be created in Zookeeper. From the
- Running Bookkeeper
bin/bookkeeper shell metaformat –nonInteractive
bin/bookkeeper bookie
for start the bookie
- HDFS
-
Prevega Standalone Mode Prerequisites (Testing & Demo Purpose)
- Java 8 or later for client-only applications
- Java 11 or later for standalone demo and server-side applications
-
Apache Druid Standalone Mode Prerequisites
- Linux, Mac OS X, or other Unix-like OS. (Windows is not supported.)
- Java 8u92+ or Java 11.
No Add-ons yet that we know of.
-
Installation of Pravega
- Download from here: https://github.com/pravega/pravega/releases
tar <name>.tgz
cd bin
./pravega-standalone
-
Installation of Apache Druid
- Download from here: https://www.apache.org/dyn/closer.cgi?path=/druid/24.0.0/apache-druid-24.0.0-bin.tar.gz
tar -xzf apache-druid-24.0.0-bin.tar.gz
cd apache-druid-24.0.0
After successfully installing Pravega and Druid. Users should start up Druid services using the micro-quickstart single-machine configuration.
Once Druid services finish startup user should lunch the Druid web console at http://localhost:8888.
Ater Druid’s web console successfully lunched user should navigate to the load page and select Pravega. In the Pravega plugin user users can pass the desired specification and submit.
Currently there isn’t any known problem.
-
Non-Collaborator Contribution
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D
-
Collaborator Contribution
git pull
at master branchgit checkout –b my-new-feature
git add some new features
git push origin my-new-feature
- Submit a pull request
Pravega readings:
https://cncf.pravega.io/docs/nightly/pravega-concepts/#introduction
https://cncf.pravega.io/docs/v0.11.0/
https://cncf.pravega.io/docs/latest/javadoc/clients/index.html
Apache Druid readings:
https://druid.apache.org/druid
https://druid.apache.org/docs/latest/design/index.html
https://druid.apache.org/use-cases
License URL: https://github.com/WSUCptSCapstone-Fall2022Spring2023/dell-pravegaolapjava/blob/master/Document-CptS421/License.txt