Skip to content

Commit 40dcc87

Browse files
authoredNov 29, 2023
Updates to best practices for recon checks and reference checks (#672)
1 parent ed72b63 commit 40dcc87

8 files changed

+41
-3
lines changed
 

‎_includes/reference-with-spark.md

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
If you are using reference checks with a Spark or Databricks data source to validate the existence of values in two datasets within the same schema, you must first convert your DataFrames into temp views to add them to the Spark session, as in the following example.
2+
```python
3+
# after adding your Spark session to the scan
4+
df.createOrReplaceTempView("df")
5+
df2.createOrReplaceTempView("df2")
6+
```

‎soda-cl/recon.md

+2
Original file line numberDiff line numberDiff line change
@@ -154,6 +154,8 @@ To efficiently use resources at scan time, best practice dictates that you first
154154

155155
Depending on the volume of data on which you must perform reconciliation checks, metric recon checks run considerably faster and use much fewer resources. Start by defining metric reconciliation checks that test grouping, filters, and joins to get meaningful insight into whether your ingestion or transformation works as expected. Where these checks do not surface all the details you need, or does not provide enough confidence in the output, then proceed with record reconciliation checks.
156156

157+
For running record reconciliation checks, you may also consider executing scans in batches. Internal experimentation continues, but early results indicate that processing about 5 MB per batch when running unordered checks, and about 10 MB for ordered checks yields the fastest processing time.
158+
157159
Read more about [Limitations and constraints](#limitations-and-constraints).
158160

159161

‎soda-cl/reference.md

+5-1
Original file line numberDiff line numberDiff line change
@@ -39,13 +39,17 @@ checks for dim_department_group:
3939
- values in (department_group_name) must exist in dim_employee (department_name)
4040
```
4141

42-
You can also validate that data in one dataset does not exist in another.
42+
You can also validate that data in one dataset does *not* exist in another.
4343
{% include code-header.html %}
4444
```yaml
4545
checks for dim_customer_staging:
4646
- values in (birthdate) must not exist in dim_customer_prod (birthdate)
4747
```
4848

49+
### Reference checks and dataframes
50+
51+
{% include reference-with-spark.md %}
52+
4953
### Failed row samples
5054

5155
Reference checks automatically collect samples of any failed rows to display Soda Cloud. The default number of failed row samples that Soda collects and displays is 100.

‎soda-cl/troubleshoot.md

+8
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ parent: SodaCL reference
1919
[Failed row check with CTE error](#failed-row-check-with-cte-error)<br />
2020
[Errors with column names containing periods or colons](#errors-when-column-names-containing-periods-or-colons)<br />
2121
[Errors when using in-check filters](#errors-when-using-in-check-filters)<br />
22+
[Using reference checks with Spark DataFrames](#using-reference-checks-with-spark-dataframes)
2223
<br />
2324

2425
## Errors with valid format
@@ -220,6 +221,13 @@ checks for my_dataset:
220221
filter: |
221222
"Status" = 'Client'
222223
```
224+
225+
## Using reference checks with Spark DataFrames
226+
227+
{% include reference-with-spark.md %}
228+
229+
230+
223231
## Go further
224232

225233
* Need help? Join the <a href="https://community.soda.io/slack" target="_blank"> Soda community on Slack</a>.

‎soda/connect-spark.md

+8
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,10 @@ scan.set_scan_definition_name('YOUR_SCHEDULE_NAME')
7979

8080
<br />
8181

82+
{% include reference-with-spark.md %}
83+
84+
<br />
85+
8286

8387
## Connect Soda Library for SparkDF to Soda Cloud
8488

@@ -167,6 +171,10 @@ print(scan.get_logs_text())
167171
168172
<br />
169173
174+
{% include reference-with-spark.md %}
175+
176+
<br />
177+
170178
## Connect to Spark for Hive
171179
172180
An addition to `soda-spark-df`, install and configure the `soda-spark[hive]` package if you use Apache Hive.

‎soda/integrate-github.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ jobs:
5050
5151
For example, in a repository in which are adding a transformation or making changes to a dbt model, you can add the **GitHub Action for Soda** to your workflow, as above. With each new pull request, or commit to an existing one, it executes a Soda scan for data quality and presents the results of the scan in a comment in the pull request, and in a report in Soda Cloud.
5252
53-
Where the scan results indicate an issue with data quality, Soda notifies you in both a PR comment and by email so that you can investigate and address any issues before merging your PR into production.
53+
Where the scan results indicate an issue with data quality, Soda notifies you in both a PR comment and by email so that you can investigate and address any issues before merging your PR into production. Note that the Action does not yet support sending notifications via Slack, only email; see [Notes and limitations](#notes-and-limitations).
5454
5555
![github-comment](/assets/images/github-comment.png){:height="500px" width="500px"}
5656

‎soda/new-documentation.md

+5
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,11 @@ parent: Learning resources
99

1010
<br />
1111

12+
#### November 28, 2023
13+
* Added to [Best practice for using reconciliation checks]({% link soda-cl/recon.md %}#best-practice-for-using-reconciliation-checks) with advice on batch processing.
14+
* Added instruction for using reference checks with DataFrames; see [Use Soda Library with Spark DataFrames on Databricks]({% link soda/connect-spark.md %}#use-soda-library-with-spark-dataframes-on-databricks).
15+
* Added content to the Self-serve Soda use case guide with a list of [compatible data sources]({% link soda/quick-start-end-user.md %}#connect-a-data-source) and a link to data source configuration reference content.
16+
1217
#### November 24, 2023
1318
* Added [release notes]({% link release-notes/all.md %}) documentation for Soda Library 1.1.24 - 1.1.25.
1419
* Added Known issue to [Group By]({% link soda-cl/group-by.md %}) configuration; does not support anomaly score checks.

‎soda/quick-start-end-user.md

+6-1
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,17 @@ Access the [exhaustive deployment instructions]({% link soda-agent/deploy.md %}#
5252

5353
## Connect a data source
5454

55+
The Soda Agent supports connections with the following data sources.
56+
{% include compatible-cloud-datasources.md %}
57+
58+
<br />
59+
5560
1. Log in to your Soda Cloud account, then navigate to **your avatar** > **Data Sources**.
5661
2. In the **Agents** tab, confirm that you can see the Soda Agent you deployed and that its status is "green" in the **Last Seen** column. If not, refer to the Soda Agent documentation to [troubleshoot]({% link soda-agent/deploy.md %}#troubleshoot-deployment) its status.
5762
![agent-running](/assets/images/agent-running.png){:height="700px" width="700px"}
5863
3. Navigate to the **Data source** tab, then click **New Data Source** and follow the [guided steps]({% link soda-agent/deploy.md %}#add-a-new-data-source) to:
5964
* identify the new data source and its default scan schedule
60-
* provide connection configuration details for the data source, and test the connection to the data source
65+
* provide [connection configuration]({% link soda/connect-athena.md %}) details for the data source, and test the connection to the data source
6166
* profile the datasets in the data source to gather basic metadata about the contents of each
6267
* identify the datasets to which you wish to apply automated monitoring for anomalies and schema changes
6368
* assign ownership roles for the data source and its datasets

0 commit comments

Comments
 (0)
Please sign in to comment.