Skip to content

Commit 4a672e6

Browse files
authored
Revise failed row sample documentation (#860)
* Revise failed row sample documentation * WIP * scan context for custom sampler + column parameter for fr u-d * Bits and pieces from various completed Jiras * Revised and updated for latest changes and improvements * reroute internal links * updated image + changelog * add failed row query for SQL metrics to failed row samples page * update configurations for block * TOC and tweaks * heavy revisions * adjustments to images * Final adjustments + address PR feedback
1 parent 31cf585 commit 4a672e6

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1071
-763
lines changed

_data/nav.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -86,8 +86,8 @@
8686
subcategories:
8787
- subtitle: Run a scan and view results
8888
page: soda-library/run-a-scan.md
89-
- subtitle: Examine failed rows
90-
page: soda-cloud/failed-rows.md
89+
- subtitle: Manage failed rows samples
90+
page: soda-cl/failed-row-samples.md
9191
- subtitle: Manage scheduled scans
9292
page: soda-cloud/scan-mgmt.md
9393
- subtitle: Configure orchestrated scans

_includes/about-soda.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ To test your data quality, you choose a flavor of Soda (choose a deployment mode
2323
* A fourth state, warn, is something you can explicitly configure for individual checks.
2424
* **Review scan results and investigate issues.** <br />You can review the scan output in the command-line and in your Soda Cloud account. Access visualized scan results, set alert notifications, track trends in data quality over time, and integrate with the messaging, ticketing, and data cataloging tools you already use, like Slack, Jira, and Atlan.
2525

26-
<sup>1</sup> An exception to this rule is when Soda collects failed row samples that it presents in scan output to aid with issue investigation, a feature you can [disable]({% link soda-cl/failed-rows-checks.md %}#disable-failed-rows-sampling-for-specific-columns).
26+
<sup>1</sup> An exception to this rule is when Soda collects failed row samples that it presents in scan output to aid with issue investigation, a feature you can [disable]({% link soda-cl/failed-row-samples.md %}#disable-all-failed-row-samples).
2727

2828
Access a [Soda product overview]({% link soda/product-overview.md %}).<br />
2929
Learn more about [How Soda works]({% link soda-library/how-library-works.md %}).<br />

_includes/custom-sampler.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
* You can save Soda Library scan results anywhere in your system; the `scan_result` object contains all the scan result information. To import Soda Library in Python so you can utilize the `Scan()` object, [install a Soda Library package]({% link soda-library/programmatic.md %}), then use `from soda.scan import Scan`.
22
* If you provide a name for the scan definition to identify inline checks in a programmatic scan as independent of other inline checks in a different programmatic scan or pipeline, be sure to set a unique scan definition name for each programmatic scan. Using the same scan definition name in multiple programmatic scans results in confused check results in Soda Cloud.
3-
* If you wish to collect samples of failed rows when a check fails, you can employ a custom sampler; see [Configure a failed row sampler]({% link soda-cl/failed-rows-checks.md %}#configure-a-failed-row-sampler).
3+
* If you wish to collect samples of failed rows when a check fails, you can employ a custom sampler; see [Configure a failed row sampler]({% link soda-cl/failed-row-samples.md %}#configure-a-python-custom-sampler).
44
* Be sure to include any variables in your programmatic scan *before* the check YAML files. Soda requires the variable input for any variables defined in the check YAML files.

_includes/disable-all-samples.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
11
To prevent Soda Cloud from receiving any sample data or failed row samples for any datasets in any data sources to which you have connected your Soda Cloud account, proceed as follows:
22

33
1. As an Admin, log in to your Soda Cloud account and navigate to **your avatar** > **Organization Settings**.
4-
2. In the **Organization** tab, check the box to "Disable collecting samples and failed rows for metrics in Soda Cloud", then **Save**.
4+
2. In the **Organization** tab, uncheck the box to **Allow Soda to collect sample data and failed row samples for all datasets**, then **Save**.
55

6-
Alternatively, if you use Soda Library, you can adjust the configuration in your `configuration.yml` to disable all samples, as in the following example.
6+
Alternatively, if you use Soda Library, you can adjust the configuration in your `configuration.yml` to disable all samples for an individual data source, as in the following example.
77
{% include code-header.html %}
88
```yaml
99
data_source my_datasource:

_includes/reroute-failed-rows.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
If the data you are checking contains sensitive information, you may wish to send any failed rows samples that Soda collects to a secure, internal location rather than Soda Cloud. These configurations apply to checks defined as no-code checks, in an agreement, or in a checks YAML file.
2+
3+
To do so, you have two options:
4+
1. **HTTP sampler**: Create a function, such as a lambda function, available at a specific URL within your environment that Soda can invoke for every check result in a data source that fails and includes failed row samples. Use the function to perform any necessary parsing from JSON to your desired format (CSV, Parquet, etc.) and store the failed row samples in a location of your choice.
5+
2. **Python CustomSampler**: If you run programmatic Soda scans of your data, add a custom sampler to your Python script to collect samples of rows with a `fail` check result. Once collected, you can print the failed row samples in the CLI, for example, or save them to an alternate destination.

_includes/samples-limit-datasource.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
If you wish to set a limit on the samples that Soda implicitly collects for an entire data source, you can do so by adjusting the configuration YAML file, or editing the **Data Source** connection details in Soda Cloud, as per the following syntax. This configuration also applies to checks defined as no-code checks.
2+
{% include code-header.html %}
3+
```yaml
4+
data_source soda_test:
5+
type: postgres
6+
host: xyz.xya.com
7+
port: 5432
8+
...
9+
sampler:
10+
samples_limit: 50
11+
```

_release-notes/manage-failed-rows.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,5 +6,5 @@ products:
66
---
77

88
Use these new ways of managing exposure to sensitive data such as personally identifiable information (PII), when collecting failed row samples.
9-
* Disable failed rows samples for [specific columns]({% link soda-cl/failed-rows-checks.md %}#disable-failed-rows-sampling-for-specific-columns) to effectively “turn off” failed row collection for specific columns in datasets.
10-
* Reroute any failed rows samples that Soda collects to a secure, internal location rather than Soda Cloud. To do so, add the storage configuration to your [sampler configuration]({% link soda-cl/failed-rows-checks.md %}#reroute-failed-rows-samples) to specify the columns you wish to exclude.
9+
* Disable failed rows samples for specific columns to effectively “turn off” failed row collection for specific columns in datasets.
10+
* Reroute any failed rows samples that Soda collects to a secure, internal location rather than Soda Cloud. To do so, add the storage configuration to your sampler configuration to specify the columns you wish to exclude.
431 KB
Loading
45.7 KB
Loading

assets/images/dataset-settings.png

107 KB
Loading

assets/images/disable-all.png

267 KB
Loading

assets/images/failed-row-samples.png

-191 KB
Binary file not shown.

assets/images/failed-rows-query.png

395 KB
Loading

assets/images/failed-rows.png

-19.3 KB
Loading

assets/images/file-storage.png

299 KB
Loading
152 KB
Loading

soda-cl/check-attributes.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -65,7 +65,7 @@ Note that you can only define or edit check attributes as an [Admin]({% link sod
6565

6666
## Apply an attribute to one or more checks
6767

68-
While only a Soda Cloud Admin can define or revise check attributes, any Author user can apply attributes to new or existing checks when:
68+
While only a Soda Cloud Admin can define or revise check attributes, any user with permission to define or change checks on a dataset can apply attributes to new or existing checks when:
6969
* writing or editing checks in an agreement in Soda Cloud
7070
* creating or editing no-code checks in Soda Cloud
7171
* writing or editing checks in a checks YAML file for Soda Library
@@ -83,18 +83,22 @@ checks for dim_product:
8383
best_before: 2022-02-20
8484
```
8585

86-
Optionally, you can add attributes to *all* the checks in a single `checks for dataset_name` block. Using the following example configuration, Soda applies the check attributes to the `duplicate_count`, `missing_percent` and `anomaly_score` checks.
86+
Optionally, you can add attributes to *all* the checks for a dataset. Using the following example configuration, Soda applies the check attributes to the `duplicate_count`, `missing_percent` checks for the `dim_product` dataset. Note that if you specify a different attribute value for an individual check than is defined in the `configurations for` block, Soda obeys the individual check's attribute instructions.
8787

8888
```yaml
89-
checks for dim_customer:
90-
- attributes:
91-
department: Marketing
92-
priority: 1
93-
- duplicate_count(last_name) < 10
94-
- missing_percent(phone) = 0
95-
- anomaly_score for row_count < default
89+
configurations for dim_product:
90+
attributes:
91+
department: []Marketing]
92+
priority: [1]
93+
94+
95+
checks for dim_product:
96+
- duplicate_count(product_line) = 0
97+
- missing_percent(standard_cost) < 3%
9698
```
9799

100+
<br />
101+
98102
During a scan, Soda validates the attribute's input – **NAME** (the key in the key:value pair), **Type**, **Allowed Values** – to ensure that the key:value pairs match the expected input. If the input is unexpected, Soda evaluates no checks, and the scan results in an error. For example, if your attribute's type is Number and the check author enters a value of `one` instead of `1`, the scan produces an error to indicate the incorrect attribute value.
99103

100104
The following table outlines the expected values for each type of attribute.

soda-cl/compare.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -185,7 +185,7 @@ Refer to [Custom check templates]({% link soda-cl/custom-check-examples.md %}#co
185185

186186
## Go further
187187
* Need help? Join the <a href="https://community.soda.io/slack" target="_blank"> Soda community on Slack</a>.
188-
* Read more about [Failed rows]({% link soda-cloud/failed-rows.md %}) in Soda Cloud.
188+
* Read more about [Failed row samples]({% link soda-cl/failed-row-samples.md %}) in Soda Cloud.
189189
* Learn more about [SodaCL metrics and checks]({% link soda-cl/metrics-and-checks.md %}) in general.
190190
<br />
191191

soda-cl/custom-check-examples.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ redirect_from: /soda-cl/check-templates.html
99
# Custom check examples
1010
*Last modified on {% last_modified_at %}*
1111

12-
Out of the box, Soda Checks Language (SodaCL) makes several built-in metrics and checks, such as `row_count`, available for you to use to define checks for data quality. If the built-in metrics that Soda offers do not quite cover some of your more specific or complex needs, you can use [user-defined]({% link soda-cl/user-defined.md %}) and [failed rows]({% link soda-cl/failed-rows-checks.md %}) checks.
12+
Out of the box, Soda Checks Language (SodaCL) makes several built-in metrics and checks, such as `row_count`, available for you to use to define checks for data quality. If the built-in metrics that Soda offers do not quite cover some of your more specific or complex needs, you can use [user-defined]({% link soda-cl/user-defined.md %}) and [failed rows checks]({% link soda-cl/failed-rows-checks.md %}).
1313

1414
**User-defined checks** and **failed rows checks** enable you to define your own metrics that you can use in a SodaCL check. You can also use these checks to simply define SQL queries or Common Table Expressions (CTE) that Soda executes during a scan, which is what most of these examples do.
1515

@@ -77,7 +77,7 @@ checks for dim_product:
7777
7878
## Find duplicates in a dataset without a unique ID column
7979
80-
You can use the built-in [duplicate_count metric]({% link soda-cl/numeric-metrics.md %}) to check the contents of a column for duplicate values and Soda automatically sends any failed rows -- that is, rows containing duplicate values -- to Soda Cloud for you to [examine]({% link soda-cloud/failed-rows.md %}).
80+
You can use the built-in [duplicate_count metric]({% link soda-cl/numeric-metrics.md %}) to check the contents of a column for duplicate values and Soda automatically sends any failed rows -- that is, rows containing duplicate values -- to Soda Cloud for you to [examine]({% link soda-cl/failed-row-samples.md %}).
8181
8282
However, if your dataset does not contain a unique ID column, as with a denormalized dataset or a dataset produced from several joins, you may need to define uniqueness using a combination of columns. This example uses a failed rows check with SQL queries to go beyond a simple, single-column check. Replace the values in the double curly braces {%raw %} {{ }} {% endraw %} with your own relevant values.
8383
@@ -359,7 +359,7 @@ checks for exchange_operations:
359359
## Go further
360360
* Need help? Join the <a href="https://community.soda.io/slack" target="_blank"> Soda community on Slack</a>.
361361
* Learn more about [Comparing data using SodaCL]({% link soda-cl/compare.md %}).
362-
* Read more about [Failed rows]({% link soda-cloud/failed-rows.md %}) in Soda Cloud.
362+
* Read more about [Failed row samples]({% link soda-cl/failed-row-samples.md %}) in Soda Cloud.
363363
<br />
364364

365365
---

0 commit comments

Comments
 (0)