Updates to best practices for recon checks and reference checks (#672)

janet-can · web-flow · commit 40dcc87583f7 · 2023-11-28T16:18:23.000-08:00
diff --git a/_includes/reference-with-spark.md b/_includes/reference-with-spark.md
@@ -0,0 +1,6 @@
+If you are using reference checks with a Spark or Databricks data source to validate the existence of values in two datasets within the same schema, you must first convert your DataFrames into temp views to add them to the Spark session, as in the following example.
+```python
+# after adding your Spark session to the scan
+df.createOrReplaceTempView("df")
+df2.createOrReplaceTempView("df2")
+```
diff --git a/soda-cl/recon.md b/soda-cl/recon.md
@@ -154,6 +154,8 @@ To efficiently use resources at scan time, best practice dictates that you first
 
 Depending on the volume of data on which you must perform reconciliation checks, metric recon checks run considerably faster and use much fewer resources. Start by defining metric reconciliation checks that test grouping, filters, and joins to get meaningful insight into whether your ingestion or transformation works as expected. Where these checks do not surface all the details you need, or does not provide enough confidence in the output, then proceed with record reconciliation checks.
 
+For running record reconciliation checks, you may also consider executing scans in batches. Internal experimentation continues, but early results indicate that processing about 5 MB per batch when running unordered checks, and about 10 MB for ordered checks yields the fastest processing time.
+
 Read more about [Limitations and constraints](#limitations-and-constraints).
 
 
diff --git a/soda-cl/reference.md b/soda-cl/reference.md
@@ -39,13 +39,17 @@ checks for dim_department_group:
   - values in (department_group_name) must exist in dim_employee (department_name)
 ```
 
-You can also validate that data in one dataset does not exist in another.  
+You can also validate that data in one dataset does *not* exist in another.  
 {% include code-header.html %}
 ```yaml
 checks for dim_customer_staging:
   - values in (birthdate) must not exist in dim_customer_prod (birthdate)
 ```
 
+### Reference checks and dataframes
+
+{% include reference-with-spark.md %}
+
 ### Failed row samples
 
 Reference checks automatically collect samples of any failed rows to display Soda Cloud. The default number of failed row samples that Soda collects and displays is 100.
diff --git a/soda-cl/troubleshoot.md b/soda-cl/troubleshoot.md
@@ -19,6 +19,7 @@ parent: SodaCL reference
 [Failed row check with CTE error](#failed-row-check-with-cte-error)<br />
 [Errors with column names containing periods or colons](#errors-when-column-names-containing-periods-or-colons)<br />
 [Errors when using in-check filters](#errors-when-using-in-check-filters)<br />
+[Using reference checks with Spark DataFrames](#using-reference-checks-with-spark-dataframes)
 <br />
 
 ## Errors with valid format
@@ -220,6 +221,13 @@ checks for my_dataset:
       filter: |
         "Status" = 'Client'  
 ```
+
+## Using reference checks with Spark DataFrames
+
+{% include reference-with-spark.md %}
+
+
+
 ## Go further
 
 * Need help? Join the <a href="https://community.soda.io/slack" target="_blank"> Soda community on Slack</a>.
diff --git a/soda/connect-spark.md b/soda/connect-spark.md
@@ -79,6 +79,10 @@ scan.set_scan_definition_name('YOUR_SCHEDULE_NAME')
 
 <br />
 
+{% include reference-with-spark.md %}
+
+<br />
+
 
 ## Connect Soda Library for SparkDF to Soda Cloud
 
@@ -167,6 +171,10 @@ print(scan.get_logs_text())
 
 <br />
 
+{% include reference-with-spark.md %}
+
+<br />
+
 ## Connect to Spark for Hive
 
 An addition to `soda-spark-df`, install and configure the `soda-spark[hive]` package if you use Apache Hive.
diff --git a/soda/integrate-github.md b/soda/integrate-github.md
@@ -50,7 +50,7 @@ jobs:
 
 For example, in a repository in which are adding a transformation or making changes to a dbt model, you can add the **GitHub Action for Soda** to your workflow, as above. With each new pull request, or commit to an existing one, it executes a Soda scan for data quality and presents the results of the scan in a comment in the pull request, and in a report in Soda Cloud.
 
-Where the scan results indicate an issue with data quality, Soda notifies you in both a PR comment and by email so that you can investigate and address any issues before merging your PR into production.
+Where the scan results indicate an issue with data quality, Soda notifies you in both a PR comment and by email so that you can investigate and address any issues before merging your PR into production. Note that the Action does not yet support sending notifications via Slack, only email; see [Notes and limitations](#notes-and-limitations).
 
 ![github-comment](/assets/images/github-comment.png){:height="500px" width="500px"}
 
diff --git a/soda/new-documentation.md b/soda/new-documentation.md
@@ -9,6 +9,11 @@ parent: Learning resources
 
 <br />
 
+#### November 28, 2023
+* Added to [Best practice for using reconciliation checks]({% link soda-cl/recon.md %}#best-practice-for-using-reconciliation-checks) with advice on batch processing.
+* Added instruction for using reference checks with DataFrames; see [Use Soda Library with Spark DataFrames on Databricks]({% link soda/connect-spark.md %}#use-soda-library-with-spark-dataframes-on-databricks).
+* Added content to the Self-serve Soda use case guide with a list of [compatible data sources]({% link soda/quick-start-end-user.md %}#connect-a-data-source) and a link to data source configuration reference content.
+
 #### November 24, 2023
 * Added [release notes]({% link release-notes/all.md %}) documentation for Soda Library 1.1.24 - 1.1.25.
 * Added Known issue to [Group By]({% link soda-cl/group-by.md %}) configuration; does not support anomaly score checks.
diff --git a/soda/quick-start-end-user.md b/soda/quick-start-end-user.md
@@ -52,12 +52,17 @@ Access the [exhaustive deployment instructions]({% link soda-agent/deploy.md %}#
 
 ## Connect a data source
 
+The Soda Agent supports connections with the following data sources.
+{% include compatible-cloud-datasources.md %}
+
+<br />
+
 1. Log in to your Soda Cloud account, then navigate to **your avatar** > **Data Sources**.
 2. In the **Agents** tab, confirm that you can see the Soda Agent you deployed and that its status is "green" in the **Last Seen** column. If not, refer to the Soda Agent documentation to [troubleshoot]({% link soda-agent/deploy.md %}#troubleshoot-deployment) its status.
 ![agent-running](/assets/images/agent-running.png){:height="700px" width="700px"}
 3. Navigate to the **Data source** tab, then click **New Data Source** and follow the [guided steps]({% link soda-agent/deploy.md %}#add-a-new-data-source) to:
 * identify the new data source and its default scan schedule
-* provide connection configuration details for the data source, and test the connection to the data source
+* provide [connection configuration]({% link soda/connect-athena.md %}) details for the data source, and test the connection to the data source
 * profile the datasets in the data source to gather basic metadata about the contents of each
 * identify the datasets to which you wish to apply automated monitoring for anomalies and schema changes
 * assign ownership roles for the data source and its datasets