Skip to content

Commit 2afadb1

Browse files
janet-canJanet Revell
and
Janet Revell
authored
Revision and expansion of SodaCL documentation. (#263)
* First draft of numeric metrics * Tiny notation. * Further revision and reorganization. * New and revised docs for several SodaCL checks and metrics. Co-authored-by: Janet Revell <[email protected]>
1 parent 61ef362 commit 2afadb1

22 files changed

+871
-274
lines changed

_data/nav.yml

+11-9
Original file line numberDiff line numberDiff line change
@@ -113,30 +113,32 @@
113113
subcategories:
114114
- subtitle: SodaCL overview
115115
page: soda-cl/soda-cl-overview.md
116-
- subtitle: Soda Core overview
117-
page: soda-cl/soda-core-overview.md
116+
- subtitle: Metrics and checks
117+
page: soda-cl/metrics-and-checks.md
118+
- subtitle: Anomaly score check
119+
page: soda-cl/anomaly-score.md
120+
- subtitle: Dataset filters
121+
page: soda-cl/dataset-filters.md
118122
- subtitle: Distribution checks
119123
page: soda-cl/distribution.md
120124
- subtitle: Duplicate checks
121125
page: soda-cl/duplicates.md
122-
- subtitle: For each checks
126+
- subtitle: For each
123127
page: soda-cl/for-each.md
124128
- subtitle: Freshness checks
125129
page: soda-cl/freshness.md
126-
- subtitle: Metrics and thresholds
127-
page: soda-cl/metrics-thresholds.md
128130
- subtitle: Missing and validity checks
129131
page: soda-cl/missing-validity.md
130-
- subtitle: Quotes
131-
page: soda-cl/quotes.md
132+
- subtitle: Numeric metrics
133+
page: soda-cl/numeric-metrics.md
134+
- subtitle: Optional check configurations
135+
page: soda-cl/optional-config.md
132136
- subtitle: Reference checks
133137
page: soda-cl/reference.md
134138
- subtitle: Row count checks
135139
page: soda-cl/row-count.md
136140
- subtitle: Schema checks
137141
page: soda-cl/schema.md
138-
- subtitle: Table filters
139-
page: soda-cl/table-filters.md
140142
- subtitle: User-defined checks
141143
page: soda-cl/user-defined.md
142144

_data/navcl.yml

-35
This file was deleted.

_includes/dataset-filters.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
It can be time-consuming to check exceptionally large datasets for data quality in their entirety. Instead of checking whole datasets, you can use a **dataset filter** to specify a portion of data in a dataset against which Soda Core executes a check.
2+
3+
1. In your checks YAML file, add a section header called `filter`, then append a dataset name and, in square brackets, the name of the filter. Refer to the example below.
4+
2. Nested under the `filter` header, use a SQL expression to specify the portion of data in a dataset that Soda Core must check. The SQL expression in the example references two variables: `ts_start` and `ts_end`. When you run the `soda scan` command, you must include these two variables as options in the command.
5+
```yaml
6+
filter CUSTOMERS [daily]:
7+
where: TIMESTAMP '${ts_start}' <= "ts" AND "ts" < TIMESTAMP '${ts_end}'
8+
```
9+
3. Add a separate section for `checks for your_dataset_name [filter name]`. Any checks you nest under this header execute *only* against the portion of data that the expression in the filter section defines. Refer to the example below.
10+
4. Write any checks you wish for the dataset and the columns in it.
11+
```yaml
12+
checks for CUSTOMERS [daily]:
13+
- row_count = 6
14+
- missing(cat) = 2
15+
```
16+
5. When you wish to execute the checks, use Soda Core to run a scan of your data source and use the `-v` option to include the values for the variables you included in your filter expression, as in the example below.
17+
```shell
18+
soda scan -d snowflake_customer_data -v ts_start=2022-03-11 ts_end=2022-03-15 checks.yml
19+
```
20+
21+
If you wish to run checks on the same dataset *without* using a filter, add a separate section for `checks for your_dataset_name` without the appended filter name. Any checks you nest under this header execute against all the data in the dataset.

_includes/foreach-config.md

+27
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
Add a **for each** section to your checks YAML file to specify a list of checks you wish to execute on multiple datasets.
2+
3+
1. Add a `for each table T` section header anywhere in your YAML file. The purpose of the `T` is only to ensure that every `for each` configuration has a unique name.
4+
2. Nested under the section header, add two nested keys, one for `tables` and one for `checks`.
5+
3. Nested under `tables`, add a list of datasets against which to run the checks. Refer to the example below that illustrates how to use `include` and `exclude` configurations and wildcard characters {% raw %} (%) {% endraw %}.
6+
4. Nested under `checks`, write the checks you wish to execute against all the datasets listed under `tables`.
7+
8+
```yaml
9+
for each table T:
10+
tables:
11+
# include the dataset
12+
- dim_customers
13+
# include all datasets matching the wildcard expression
14+
- dim_products%
15+
# (optional) explicitly add the word include to make the list more readable
16+
- include dim_employee
17+
# exclude a specific dataset
18+
- exclude fact_survey_response
19+
# exclude any datasets matching the wildcard expression
20+
- exclude prospective_%
21+
checks:
22+
- row_count > 0
23+
```
24+
25+
* Soda Core dataset names matching is case insensitive.
26+
* If any of your checks specify column names as arguments, make sure the column exists in all datasets listed under the `tables` heading.
27+
* To add multiple for each configurations in your checks YAML file, configure another `for each` section header with a different letter identifier, such as `for each table R`.

_includes/list-symbols.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
```yaml
2+
=
3+
<
4+
>
5+
<=
6+
>=
7+
!=
8+
<>
9+
between
10+
not between
11+
```

assets/images/check-result.png

147 KB
Loading

assets/images/check.png

55.8 KB
Loading
180 KB
Loading

assets/images/unique-check.png

106 KB
Loading

assets/images/user-defined-check.png

92.2 KB
Loading

soda-cl/anomaly-score.md

+52
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
---
2+
layout: default
3+
title: Anomaly score check
4+
description:
5+
parent: Soda CL
6+
---
7+
8+
# Anomaly score check ![beta](/assets/images/beta.png){:height="50px" width="50px" align="top"}
9+
10+
The anomaly score check is powered by a machine learning algorithm that works with measurements that occur over time. The algorithm learns the patterns of your data – its trends and seasonality – to identify and flag anomalous measurements in time-series data.
11+
12+
If you have connected Soda Core to a Soda Cloud account, Soda Core pushes check results to your cloud account where Soda Cloud stores all the historic measurements for your checks in a metric store. SodaCL can then use these stored values to establish a baseline of normal measurements against which to evaluate future measurements to identify anomalies. Therefore, you must have a [Soda Cloud account]({% link soda-cloud/overview.md%}) to use change-over-time thresholds.
13+
14+
## Prerequisites
15+
* a Soda Cloud account, connected to Soda Core
16+
* Soda Core Scientific package installed
17+
18+
19+
## Configuration
20+
21+
The following example demonstrates how to use the anomaly score for the `row_count` metric in a check. You can use any numeric metrics in lieu of `row_count`. By default, anomaly score checks yield warning check results, not failures.
22+
23+
```yaml
24+
checks for CUSTOMERS:
25+
- anomaly score for row_count < default
26+
```
27+
<br />
28+
29+
If you wish, you can override the anomaly score. <!--why would you want to do this? what is the .7 a portion of?--> The following check yields a warning check result if the anomaly score for `row_count` exceeds `.7`.
30+
31+
```yaml
32+
checks for CUSTOMERS:
33+
- anomaly score for row_count < .7
34+
```
35+
36+
<br />
37+
38+
Further, you can use `warn` and `fail` thresholds with the anomaly score. The following example demonstrates how to define the threshold for `row_count` that yields a warning, and the threshold that yields a failed check result. Note that an individual check only ever yields one check result. If your check triggers both a `warn` and a `fail`, the check result only displays the more serious, failed check result.
39+
```yaml
40+
checks for CUSTOMERS:
41+
- anomaly score for row_count:
42+
warn: when > .8
43+
fail: when > .99
44+
```
45+
46+
## Go further
47+
48+
* Need help? Join the <a href="http://community.soda.io/slack" target="_blank"> Soda community on Slack</a>.
49+
<br />
50+
51+
---
52+
{% include docs-footer.md %}

soda-cl/dataset-filters.md

+15
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
---
2+
layout: default
3+
title: dataset filters
4+
description: Instead of checking whole tables, you can use SodaCL (Beta) table filters to specify a portion of data in a table against which Soda Core executes a check.
5+
parent: SodaCL (Beta)
6+
redirect_from:
7+
- soda-cl/table-filters.html
8+
---
9+
10+
# Dataset filters ![beta](/assets/images/beta.png){:height="50px" width="50px" align="top"}
11+
12+
{% include dataset-filters.md %}
13+
14+
---
15+
{% include docs-footer.md %}

soda-cl/for-each.md

+2-26
Original file line numberDiff line numberDiff line change
@@ -5,34 +5,10 @@ description: Use a SodaCL (Beta) for each check to specify a list of checks you
55
parent: SodaCL (Beta)
66
---
77

8-
# For each checks ![beta](/assets/images/beta.png){:height="50px" width="50px" align="top"}
8+
# For each ![beta](/assets/images/beta.png){:height="50px" width="50px" align="top"}
99

10-
Use a for each check to specify a list of checks you wish to execute on a multiple tables.
1110

12-
First, in your checks.yml file, specify the list of tables using `for each table T`. The purpose of the `T` is only to ensure that every `for each` check has a unique name. Next, write the checks you wish to execute against the tables.
13-
14-
```yaml
15-
for each table T:
16-
tables:
17-
# include the table
18-
- CUSTOMERS
19-
# include all tables matching the wildcard expression
20-
- new_%
21-
# (optional) explicitly add the word include to make the list more readable
22-
- include CUSTOMERS
23-
# exclude a specific table
24-
- exclude fact_survey_response
25-
# exclude any tables matching the wildcard expression
26-
- exclude prospective_%
27-
checks:
28-
- row_count > 0
29-
```
30-
31-
### Notes
32-
33-
* Soda Core resolves all table names in the scan's default data source.
34-
* You can use `%` as a wildcard in both data source name and table name filters.
35-
* Soda Core table names matching is case insensitive.
11+
{% include foreach-config.md %}
3612

3713
---
3814
{% include docs-footer.md %}

0 commit comments

Comments
 (0)