You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: soda/quick-start-databricks-pipeline.md
+17-17
Original file line number
Diff line number
Diff line change
@@ -32,15 +32,15 @@ Use this guide as an example for how to set up and use Soda to test the quality
32
32
33
33
## About this guide
34
34
35
-
The instructions below offer Data Scientists an example of how to execute Soda Checks Language (SodaCL) checks for data quality within a Databricks pipeline that handles data which trains a machine learning (ML) model.
35
+
The instructions below offers an example of how to execute Soda Checks Language (SodaCL) checks for data quality within a Databricks pipeline that handles data which trains a machine learning (ML) model.
36
36
37
-
For context, this guide demonstrates a Data Scientist working with Human Resources data to build a forecast model for employee attrition. The Data Scientist uses a Databricks notebook to gather data from SQL-accessible dataset, transforms the data into the correct format for their ML model, then uses the data to train the model.
37
+
For context, this guide demonstrates a Data Scientist and Data Engineer working with Human Resources data to build a forecast model for employee attrition. The Data Engineer, working with a Data Scientist, uses a Databricks notebook to gather data from SQL-accessible dataset, transforms the data into the correct format for their ML model, then uses the data to train the model.
38
38
39
-
Though they do not have direct access to the data to be able to resolve issues themselves, the Data Scientist can use Soda to detect data quality issues before the data model trains on poor-quality data. The pipeline the Data Scientist creates includes various SodaCL checks embedded at two stages in the pipeline: after data ingestion and after data transformation. At the end of the process, the pipeline stores the checks' metadata in a Databricks table which feeds into a data quality dashboard. The Data Scientist utilizes Databricks workflows to schedule this process on a daily basis.
39
+
Though they do not have direct access to the data to be able to resolve issues themselves, the Data Engineer can use Soda to detect data quality issues before the data model trains on poor-quality data. The pipeline the Data Engineer creates includes various SodaCL checks embedded at two stages in the pipeline: after data ingestion and after data transformation. At the end of the process, the pipeline stores the checks' metadata in a Databricks table which feeds into a data quality dashboard. The Data Engineer utilizes Databricks workflows to schedule this process on a daily basis.
40
40
41
41
## Prerequisites
42
42
43
-
The Data Scientist in this example uses the following:
43
+
The Data Engineer in this example uses the following:
44
44
* Python 3.8, 3.9, or 3.10
45
45
* Pip 21.0 or greater
46
46
* a Databricks account
@@ -51,17 +51,17 @@ The Data Scientist in this example uses the following:
51
51
52
52
To validate an account license or free trial, Soda Library must communicate with a Soda Cloud account via API keys. You create a set of API keys in your Soda Cloud account, then use them to configure the connection to Soda Library.
53
53
54
-
1. In a browser, the Data Scientist navigates to <ahref="https://cloud.soda.io/signup"target="_blank">cloud.soda.io/signup</a> to create a new Soda account, which is free for a 45-day trial.
54
+
1. In a browser, the Data Engineer navigates to <ahref="https://cloud.soda.io/signup"target="_blank">cloud.soda.io/signup</a> to create a new Soda account, which is free for a 45-day trial.
55
55
2. They navigate to **your avatar** > **Profile**, access the **API keys** tab, then click the plus icon to generate new API keys.
56
56
3. They copy+paste the API key values to a temporary, secure place in their local environment.
57
57
58
58
## Connect Soda Cloud to Soda Library and data source
59
59
60
-
1. Within Databricks, the Data Scientist creates two notebooks:
60
+
1. Within Databricks, the Data Engineer creates two notebooks:
61
61
***Data Ingestion Checks**, which runs scans for data quality after data is ingested into a Unity catalog
62
62
***Input Data Checks**, which prepares data for training a machine learning model and runs data quality scans before submitting to the model for training
63
-
2. In the same directory as the Databricks notebooks, the Data Scientist creates a `soda_settings` directory to contain this configuration file, and, later, the check YAML files that Soda needs to run scans. To connect Soda to the Unity catalog, the Data Scientist prepares a `soda_conf.yml` file which stores the data source connection details.
64
-
3. To the file, they add the data source connection configuration to the Unity catalog that contains the Human Resources data the Data Scientist uses, and the Soda Cloud API key connection configuration, then they save the file.
63
+
2. In the same directory as the Databricks notebooks, the Data Engineer creates a `soda_settings` directory to contain this configuration file, and, later, the check YAML files that Soda needs to run scans. To connect Soda to the Unity catalog, the Data Engineer prepares a `soda_conf.yml` file which stores the data source connection details.
64
+
3. To the file, they add the data source connection configuration to the Unity catalog that contains the Human Resources data the Data Engineer uses, and the Soda Cloud API key connection configuration, then they save the file.
65
65
{% include code-header.html %}
66
66
```yaml
67
67
data_source employee_info:
@@ -90,7 +90,7 @@ Read more: [How Soda works]({% link soda-library/how-library-works.md %}) <br /
90
90
91
91
A check is a test that Soda executes when it scans a dataset in your data source. The `checks.yml` file stores the checks you write using the Soda Checks Language. You can create multiple checks files to organize your data quality checks and run all, or some of them, at scan time.
92
92
93
-
In this example, the Data Scientist creates two checks files in the `soda_settings` directory in Databricks:
93
+
In this example, the Data Engineer creates two checks files in the `soda_settings` directory in Databricks:
94
94
* `ingestion_checks.yml` to execute quality checks after data ingestion into the Unity catalog in the Data Ingestion Checks notebook
95
95
* `input_data_checks.yml` to execute quality checks after transformation, and before using it to train their ML model in the Input Data Checks notebook.
The Data Scientist creates a checks YAML file to write checks that apply to the datasets they use to train their ML model. The Data Ingestion Checks notebook runs these checks after the data is ingested into the Unity catalog. For any checks that fail, the Data Scientist can notify upstream Data Engineers or Data Product Owners to address issues such as missing data or invalid entries.
106
+
The Data Engineer creates a checks YAML file to write checks that apply to the datasets they use to train their ML model. The Data Ingestion Checks notebook runs these checks after the data is ingested into the Unity catalog. For any checks that fail, the Data Engineer can notify upstream Data Engineers or Data Product Owners to address issues such as missing data or invalid entries.
107
107
108
-
Many of the checks that the Data Scientist prepares include [check attributes]({% link soda-cl/check-attributes.md %}) which they created in Soda Cloud; see image below. When added to checks, the Data Scientist can use the attributes to filter check results in Soda Cloud, build custom views ([Collections]({% link soda-cloud/collaborate.md %}#build-check-collections)), and stay organized as they monitor data quality in the Soda Cloud UI. Skip to [Review check results](#review-check-results) to see an example.
108
+
Many of the checks that the Data Engineer prepares include [check attributes]({% link soda-cl/check-attributes.md %}) which they created in Soda Cloud; see image below. When added to checks, the Data Engineer can use the attributes to filter check results in Soda Cloud, build custom views ([Collections]({% link soda-cloud/collaborate.md %}#build-check-collections)), and stay organized as they monitor data quality in the Soda Cloud UI. Skip to [Review check results](#review-check-results) to see an example.
The Data Scientist also added a [dataset filter]({% link soda-cl/filters.md %}#configure-dataset-filters) to the quality checks that apply to the application login data. The filter serves to partition the data against which Soda executes the checks; instead of checking for quality on the entire dataset, the filter limits the scan to the previous day's data.
114
+
The Data Engineer also added a [dataset filter]({% link soda-cl/filters.md %}#configure-dataset-filters) to the quality checks that apply to the application login data. The filter serves to partition the data against which Soda executes the checks; instead of checking for quality on the entire dataset, the filter limits the scan to the previous day's data.
115
115
116
116
ingestion_checks.yml
117
117
{% include code-header.html %}
@@ -259,9 +259,9 @@ checks for login_logout [daily]:
259
259
260
260
## Post-transformation checks
261
261
262
-
The Data Scientists also prepared a second set of SodaCL checks in a separate file to run after transformation in the Input Data Checks notebook. Curious readers can download the <a href="/assets/Data Ingestion Checks.ipynb" download>ETL notebook.ipynb</a> to review transformations and the resulting `input_data_attrition_model` output into a DataFrame.
262
+
The Data Engineer also prepared a second set of SodaCL checks in a separate file to run after transformation in the Input Data Checks notebook. Curious readers can download the <a href="/assets/Data Ingestion Checks.ipynb" download>ETL notebook.ipynb</a> to review transformations and the resulting `input_data_attrition_model` output into a DataFrame.
263
263
264
-
Two of the checks the Data Scientist prepares involve checking groups of data. The [group evolution check]({% link soda-cl/group-evolution.md %}) validates the presence or absence of a group in a dataset, or to check for changes to groups in a dataset relative to their previous state; in this case, it confirms the presence of the `Married` group in the data, and when any group changes. Further, the [group by check]({% link soda-cl/group-by.md %}) collects and presents check results by category; in this case, it groups the results according to `JobLevel`.
264
+
Two of the checks the Data Engineer prepares involve checking groups of data. The [group evolution check]({% link soda-cl/group-evolution.md %}) validates the presence or absence of a group in a dataset, or to check for changes to groups in a dataset relative to their previous state; in this case, it confirms the presence of the `Married` group in the data, and when any group changes. Further, the [group by check]({% link soda-cl/group-by.md %}) collects and presents check results by category; in this case, it groups the results according to `JobLevel`.
265
265
266
266
input_data_checks.yml
267
267
{% include code-header.html %}
@@ -335,7 +335,7 @@ checks for input_data_attrition_model [daily]:
335
335
336
336
## Invoke Soda in Databricks notebooks
337
337
338
-
At the [beginning](#connect-soda-cloud-to-soda-library-and-data-source) of this exercise, the Data Scientist created two notebooks in their Databricks workflow:
338
+
At the [beginning](#connect-soda-cloud-to-soda-library-and-data-source) of this exercise, the Data Engineer created two notebooks in their Databricks workflow:
339
339
* **Data Ingestion Checks** to run after data is ingested into the Unity catalog
340
340
* **Input Data Check** to run after transformation, and before using the data to train the ML model
341
341
@@ -477,7 +477,7 @@ print(scan.get_logs_text())
477
477
478
478
## Review check results in Soda Cloud
479
479
480
-
After running the notebooks, the Data Scientist accesses Soda Cloud to review the check results.
480
+
After running the notebooks, the Data Engineer accesses Soda Cloud to review the check results.
481
481
482
482
In the **Checks** page, they apply filters to narrow the results to the datasets involved in the Employee Attrition ML model, and distill the results even further by selecting to display only those results with the Pipeline attribute of `Ingest`. They save the results as a Collection labeled **Employee Attrition - Ingestion** to easily access the relevant quality results in the future.
483
483
@@ -487,7 +487,7 @@ In the **Checks** page, they apply filters to narrow the results to the datasets
487
487
488
488
## Review check results in a Unity dashboard
489
489
490
-
After the Data Scientist trains the model to forecast employee attrition, they decide to devise an extra step in the process to use the [Soda Cloud API]({% link api-docs/public-cloud-api-v1.md %}) export all the Soda check results and dataset metadata back into the Unity catalog, then build a dashboard to display the results.
490
+
After the Data Engineer trains the model to forecast employee attrition, they decide to devise an extra step in the process to use the [Soda Cloud API]({% link api-docs/public-cloud-api-v1.md %}) export all the Soda check results and dataset metadata back into the Unity catalog, then build a dashboard to display the results.
491
491
492
492
*Coming soon:* a tutorial for building a dashboard using the Soda Cloud API!
0 commit comments