A command-line application that generates measurements data for MIND Foods Hub Data Lake.
Data is generated in two formats: CSV and ndjson.
MIND Foods Hub data are stored in a single table, named dl_measurements, that follows a denormalized data model to avoid expensive join operations.
This means that, for each row of the table, we can have missing (NULL) values, depending on the type of measurement.
This is the table schema:
CREATE TABLE dl_measurements
(
id string,
double_value double,
str_value string,
unit_of_measure string,
sensor_id string,
sensor_type string,
sensor_desc_name string,
location_id string,
location_name string,
location_description string,
location_botanic_name string,
location_cultivation_name string,
location_latitude double,
location_longitude double,
location_altitude double,
measure_timestamp timestamp,
start_timestamp timestamp,
end_timestamp timestamp,
insertion_agent string,
insertion_timestamp timestamp,
CONSTRAINT dl_measurements_pk
PRIMARY KEY (id) DISABLE NOVALIDATE
)
PARTITIONED BY (partion_date string)MIND Foods Hub sensors are of three types:
-
Measurements, that register discrete, floating-point, values (for example temperature, humidity, wind speed, etc, etc). This type of measurement is stored in
double_valuecolumn, while the time of the measurement is stored in themeasure_timestampcolumn. -
Phase sensors, that register a range of floating-point values in a given period.
This type of measurement is stored in thestr_valuecolumn, while the time start and end of the measurement are stored respectively in thestart_timestampandend_timestampcolumns. -
Tag sensors, that register string-based values.
This type of measurement is stored indouble_valuecolumn, while the time of the measurement is stored in themeasure_timestampcolumn.
To randomly generate data for dl_measurements we need to mock this relation between a sensor type and its measurement, and guarantee these logical constraints:
-
double_valueis only populated for float-based measurements whilestr_valueisNULL.
measure_timestampis calculated, whilestart_timestampandend_timestampareNULL -
For phase-based measurement
str_valueis populated, whiledouble_valueisNULL.
Bothstart_timestampandend_timestamptimes are calculated, whilemeasure_timestampisNULL -
For tag based measurement
str_valueis populated, whiledouble_valueisNULL.
measure_timestampis calculated, whilestart_timestampandend_timestampareNULL
First, install "MFH Measurements data generator" dependencies:
$ npm iThen run the application with the following command:
$ node index.jsBy default "MFH Measurements data generator" generates 5 million rows for both the CSV and ndjson files.
To configure the number of rows to generate, use NUMBER_OF_ROWS env variable:
$ NUMBER_OF_ROWS=100 node index.jsOther configuration env variables can be found in config.js file.