From abcd615befea34ffaea4278b8f95d962b30f9c7e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Santiago=20V=C3=ADquez?= Date: Tue, 25 Mar 2025 14:47:10 -0600 Subject: [PATCH 1/6] overview draft --- overview/data-testing.md | 9 ---- overview/observability.md | 9 ---- overview/overview.md | 90 ++++++++++++++++++++++++++++++++++++++- 3 files changed, 89 insertions(+), 19 deletions(-) delete mode 100644 overview/data-testing.md delete mode 100644 overview/observability.md diff --git a/overview/data-testing.md b/overview/data-testing.md deleted file mode 100644 index adce54e9..00000000 --- a/overview/data-testing.md +++ /dev/null @@ -1,9 +0,0 @@ ---- -layout: default -title: Data testing -nav_order: 310 -description: What is data testing? -parent: Soda Overview ---- - -# What is data testing? \ No newline at end of file diff --git a/overview/observability.md b/overview/observability.md deleted file mode 100644 index e006108e..00000000 --- a/overview/observability.md +++ /dev/null @@ -1,9 +0,0 @@ ---- -layout: default -title: Observability -nav_order: 320 -description: What is observability? -parent: Soda Overview ---- - -# What is observability? \ No newline at end of file diff --git a/overview/overview.md b/overview/overview.md index 9ba99249..6138e42f 100644 --- a/overview/overview.md +++ b/overview/overview.md @@ -6,4 +6,92 @@ nav_order: 300 --- # Soda Overview -*Last modified on {% last_modified_at %}* \ No newline at end of file +*Last modified on {% last_modified_at %}* + +Soda helps data teams build reliable data products and pipelines. + +## What Soda does +You can use Soda to test data as it flows through your pipelines and monitor data quality over time. Embed tests directly in your workflows or use Soda’s built-in observability features to detect and resolve data issues early. + +Soda helps you answer key questions about your data: + +- Is the data fresh? +- Is any data missing? +- Are there duplicate records? +- Did something go wrong during a transformation? +- Are all values within expected ranges? +- Are data quality metrics changing over time? Are there anomalies in freshness, row counts, or missing values? + +## How Soda approaches Data Quality + +Soda follows two complementary approaches to managing data quality: data testing and data observability. Together, they help you prevent data issues and detect unexpected changes in production. + +### Data Testing +Data testing is a proactive approach to catch data quality issues before they impact downstream systems. It belongs early in your data lifecycle—during development, deployment, or transformation. + +**Use data testing to:** +- Validate data during CI/CD workflows +- Compare source and target tables for reconciliation +- Check assumptions in transformation logic +- Enforce data contracts between teams and systems + +Data tests are explicit, rule-based checks that you can define based on known expectations. + +### Data Observability +Data observability is a reactive approach to monitor data in production and catch unexpected issues as they emerge. It helps answer the question: What is happening with my data right now, and how is that changing over time? + +**Use data observability to:** +- Detect anomalies in data quality metrics such as freshness, row counts, or null values +- Monitor metric trends and seasonality +- Identify late-arriving or missing records +- Get alerted when values deviate from historical norms + +## How Soda fits into your stack + +Soda integrates with all major data platforms, including: + +- **Databases and data warehouses:** BigQuery, Snowflake, Redshift, Databricks, PostgreSQL, Spark, Dask, PostgreSQL, Presto, DuckDB and more. +- **Data catalog and metadata tools:** Atlaion, Atlan, Collibra, data.world, Zeenea and more. +- **Orchestration platforms:** Airflow, Azure Data Factory, Dagster, dbt, Prefect and more. +- **Cloud providers:** AWS, Google Cloud, Azure. +- **BI Tools:** Looker, Tableua, PowerBi. +- **Messaging and Ticketing:** Jira, Opsgenie, PagerDuty, ServiceNow, MicroSoft Teams and Slack. + +You can set up data quality tests programmatically using Soda Library, or configure them through the Soda Cloud user interface—without writing code. Test results are pushed to Soda Cloud for monitoring, collaboration, and alerting. + +### Soda's deployment options + +You can deploy Soda in three ways, depending on your team’s scale, security needs, and infrastructure preferences. + +#### Self-operated deployment + +Install Soda Library locally and connect it to Soda Cloud using API keys. +Soda Library scans your datasets and pushes metadata to Soda Cloud. There, your team can view check results, collaborate on incidents, and integrate with tools like Slack. + +By default, your data stays within your private network. See [Data security and privacy (missing)](#) for more details. To learn more about how to set up a self-operated deployment check out [Self-operated deployment guide (missing)](#) + +![with-library](/assets/images/with-library.png){:height="500px" width="500px"} + +#### Soda-hosted deployment + +Use Soda Cloud to connect directly to your data sources. Soda-hosted deployment gives you a secure, managed way to scan data, create no-code checks, and share insights—all from the UI. + +This option supports BigQuery, Databricks SQL, MS SQL Server, MySQL, PostgreSQL, Redshift, and Snowflake. To learn more about how to set up a soda-hosted deployment check out [Soda-hosted deployment guide (missing)](#) + +![with-managed-agent](/assets/images/with-managed-agent.png){:height="60px" width="600px"} + +#### Self-hosted deployment + +Run Soda Library inside your own Kubernetes cluster in AWS, Google Cloud, or Azure. + +This deployment gives infrastructure teams full control over how Soda accesses data while still enabling Soda Cloud users to write and view checks. Checks can be written programmatically or through the UI. To learn more about how to set up a self-hosted deployment check out [Self-hosted deployment guide (missing)](#) + +![with-agent](/assets/images/with-agent.png){:height="60px" width="600px"} + + +## Where to go next? + +To get started with Soda, follow one of these quickstarts based on your needs: + +- [Data testing quickstart](#): Learn how to define and run checks in your workflows. +- [Data observability quickstart](#): Set up monitoring to detect anomalies in your datasets. \ No newline at end of file From df35900ffdb14650b5981c03dcd65cbb6cdff811 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Santiago=20V=C3=ADquez?= Date: Tue, 25 Mar 2025 18:42:13 -0600 Subject: [PATCH 2/6] add obs quickstart link --- overview/overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/overview/overview.md b/overview/overview.md index 6138e42f..6ddb084c 100644 --- a/overview/overview.md +++ b/overview/overview.md @@ -94,4 +94,4 @@ This deployment gives infrastructure teams full control over how Soda accesses d To get started with Soda, follow one of these quickstarts based on your needs: - [Data testing quickstart](#): Learn how to define and run checks in your workflows. -- [Data observability quickstart](#): Set up monitoring to detect anomalies in your datasets. \ No newline at end of file +- [Data observability quickstart]({% link observability/quickstart.md %}): Set up monitoring to detect anomalies in your datasets. \ No newline at end of file From 0ea9fe203938b827d5e5d3304e43d47e0ceb09a2 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Santiago=20V=C3=ADquez?= Date: Tue, 25 Mar 2025 21:01:19 -0600 Subject: [PATCH 3/6] rename section --- overview/overview.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/overview/overview.md b/overview/overview.md index 6ddb084c..3aa4b5d2 100644 --- a/overview/overview.md +++ b/overview/overview.md @@ -89,7 +89,7 @@ This deployment gives infrastructure teams full control over how Soda accesses d ![with-agent](/assets/images/with-agent.png){:height="60px" width="600px"} -## Where to go next? +## What's Next? To get started with Soda, follow one of these quickstarts based on your needs: From 03044f37da257ba3c0faf8291845d6ffead61352 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Santiago=20V=C3=ADquez?= Date: Tue, 25 Mar 2025 21:01:35 -0600 Subject: [PATCH 4/6] full observability quickstart --- observability/quickstart.md | 116 ++++++++++++++++++++++++++---------- 1 file changed, 83 insertions(+), 33 deletions(-) diff --git a/observability/quickstart.md b/observability/quickstart.md index 63906061..efca9bf9 100644 --- a/observability/quickstart.md +++ b/observability/quickstart.md @@ -1,7 +1,7 @@ --- layout: default -title: Quickstart observability -description: Quickstart observability +title: Quickstart Observability +description: Quickstart Observability parent: Observability nav_order: 511 --- @@ -10,23 +10,38 @@ nav_order: 511 *Last modified on {% last_modified_at %}* -In this Quickstart, you'll: -- create a Soda Cloud account, -- connect a data source, and -- configure your first dataset to enable observability. +In this quickstart, you will: +- Create a Soda Cloud account +- Connect a data source +- Configure your first dataset to enable observability. ## Step 1: Create a Soda Cloud Account -1. Go to cloud.soda.io and create a Soda Cloud account. - If you already have a Soda account, log in. -2. By default, Soda prepares a Soda-hosted agent for all newly-created accounts. However, if you are an Admin in an existing Soda Cloud account and wish to use a Soda-hosted agent, navigate to **your avatar** > **Organization Settings**. In the **Organization** tab, click the checkbox to **Enable Soda-hosted Agent**. -3. Navigate to **your avatar** > **Data Sources**, then access the **Agents** tab. Notice your out-of-the-box Soda-hosted agent that is up and running.
+1. Go to cloud.soda.io and sign up for a Soda Cloud account. If you already have an account, log in. +2. By default, Soda creates a Soda-hosted Agent for all new accounts. You can think of an Agent as the bridge between your data sources and Soda Cloud. A Soda-hosted Agent runs in Soda's cloud and securely connects to your data sources to scan for data quality issues. +3. If you are an admin and prefer to deploy your own agent, you can configure a self-hosted agent: + +- In Soda Cloud, go to **your avatar** > **Agents** +- Click **New Soda Agent** and follow the setup instructions +
![soda-hosted-agent](/assets/images/soda-hosted-agent.png){:height="700px" width="700px"} +> **Soda Agent Basics** +>
+> There are two types of Soda Agents: +> 1. **Soda-hosted Agent:** This is an out-of-the-box, ready-to-use agent that Soda provides and manages for you. It's the quickest way to get started with Soda as it requires no installation or deployment. It supports connections to specific data sources like BigQuery, Databricks SQL, MS SQL Server, MySQL, PostgreSQL, Redshift, and Snowflake. [Soda-hosted agent (missing)](#) +> 2. **Self-hosted Agent:** This is a version of the agent that you deploy in your own Kubernetes cluster within your cloud environment (like AWS, Azure, or Google Cloud). It gives you more control and supports a wider range of data sources. [Self-hosted agent (missing)](#) +> +> A Soda Agent is essentially Soda Library (the core scanning technology) packaged as a containerized application that runs in Kubernetes. It acts as the bridge between your data sources and Soda Cloud, allowing users to: +> - Connect to data sources securely +> - Run scans to check data quality +> - Create and manage no-code checks directly in the Soda Cloud interface +> +> The agent only sends metadata (not your actual data) to Soda Cloud, keeping your data secure within your environment. Soda [Agent basic concepts (missing)](#) + ## Step 2: Add a Data Source -1. In your Soda Cloud account, navigate to **your avatar** > **Data Sources**. -2. Click **New Data Source**, then follow the guided steps to create a new data source (e.g., PostgreSQL, BigQuery). - Enter the required connection details (host, port, database name, credentials). - Refer to the section - **Attributes** below for insight into the values to enter in the fields and editing panels in the guided steps. +1. In Soda Cloud, go to **your avatar** > **Data Sources**. +2. Click **New Data Source**, then follow the guided steps to create the connection. +Use the table below to understand what each field means and how to complete it: #### Attributes @@ -39,12 +54,10 @@ In this Quickstart, you'll: | Custom Cron Expression | (Optional) Write your own cron expression to define the schedule Soda Cloud uses to run scans. | | Anomaly Dashboard Scan Schedule
![available-2025](/assets/images/available-2025.png){:height="150px" width="150px"}
| Provide the scan frequency details Soda Cloud uses to execute a daily scan to automatically detect anomalies for the anomaly dashboard. | +{:start="3"} +3. Complete the connection configuration. These settings are specific to each data source (PostgreSQL, MySQL, Snowflake, etc) and usually include connection details such as host, port, credentials, and database name. -3. Enter values in the fields to provide the connection configurations Soda Cloud needs to be able to access the data in the data source. Connection configurations are data source-specific and include values for things such as a database's host and access credentials. - -Soda hosts agents in a secure environment in Amazon AWS. As a SOC 2 Type 2 certified business, Soda responsibly manages Soda-hosted agents to ensure that they remain private, secure, and independent of all other hosted agents. See [Data security and privacy]({% link soda/data-privacy.md %}#using-a-soda-hosted-agent) for details. - -Use the following data source-specific connection configuration pages to populate the connection fields in Soda Cloud. +Use the appropriate guide below to complete the connection: * [Connect to BigQuery]({% link soda/connect-bigquery.md %}) * [Connect to Databricks SQL]({% link soda/connect-spark.md %}#connect-to-spark-for-databricks-sql) * [Connect to MS SQL Server]({% link soda/connect-mssql.md %}) @@ -53,27 +66,64 @@ Use the following data source-specific connection configuration pages to populat * [Connect to Redshift]({% link soda/connect-redshift.md %}) * [Connect to Snowflake]({% link soda/connect-snowflake.md %}) -💡 Already have data source connected to a self-hosted agent? You can [migrate]({% link soda/upgrade.md %}#migrate-a-data-source-from-a-self-hosted-to-a-soda-hosted-agent) a data source to a Soda-hosted agent. +## Step 3: Configure Dataset Discovery +Dataset discovery captures metadata about each dataset, including its schema and the data types of each column. + +- In Step 3 of the guided workflow, specify the datasets you want to profile. Because dataset discovery can be resource-intensive, only include the datasets you need for observability. +See [Compute consumption and cost considerations]({% link soda-cl/profile.md %}#compute-consumption-and-cost-considerations) for more detail. -## Step 3: Select and Configure a Dataset +## Step 4: Add Column Profiling +Column profiling extracts metrics such as the mean, minimum, and maximum values in a column, and the number of missing values. -1. In the editing panel of **4. Profile**, use the include and exclude syntax to indicate the datasets for which Soda must profile and prepare an anomaly dashboard. The default syntax in the editing panel instructs Soda to profile every column of every dataset in the data source, and, superfluously, all datasets with names that begin with prod. The `%` is a wildcard character. See [Add column profiling]({% link soda-cl/profile.md %}#add-column-profiling) for more detail on profiling syntax. +- In Step 4 of the guided workflow, use include/exclude patterns to define which columns Soda should profile. Soda uses this information to power the anomaly dashboard. Learn more about [column profiling syntax]({% link soda-cl/profile.md %}#add-column-profiling). ```yaml - profile columns: - columns: - - "%.%" # Includes all your datasets - - prod% # Includes all datasets that begin with 'prod' +profile columns: + columns: + - "%.%" # Includes all columns of all datasets + - "prod%.%" # Includes all columns of all datasets that begin with 'prod' ``` -2. Continue the remaining steps to add your new data source, then **Test Connection**, if you wish, and **Save** the data source configuration. +## Step 5: Add Automated Monitoring Checks +In Step 5 of the guided workflow, define which datasets should have automated checks applied for anomaly scores and schema evolution. + +> If you are using the early access anomaly dashboard, this step is not required. Soda automatically enables monitoring in the > dashboard. See [Anomaly Dashboard]({% link soda-cloud/anomaly-dashboard.md %}) for details. + +Use include/exclude filters to target specific datasets. Read more about [automated monitoring configuration]({% link soda-cl/automated-monitoring.md %}). + +```yaml +automated monitoring: + datasets: + - include prod% # Includes all the datasets that begin with 'prod' + - exclude test% # Excludes all the datasets that begin with 'test' +``` + +## Step 6: Assing a Data Source and Dataset Owner +In the step 6 of the guided workflow, assign responsibility for maintaining the data source and each dataset. + +- **Data Source Owner:** Manages the connection settings and scan configurations for the data source. +- **Dataset Owner:** Becomes the default owner of each dataset for monitoring and collaboration. + +For more details, see [Roles and rights in Soda Cloud]({% link soda-cloud/roles-global.md %}). + +## Step 7: Test Connection and Save +- Click **Test Connection** to verify your configuration. +- Click **Save** to start profiling the selected datasets. + +Once saved, Soda runs a first scan using your profiling settings. This initial scan provides baseline measurements that Soda uses to begin learning patterns and identifying anomalies. + +## Step 8: View Metric Monitor Results +1. Go to the **Datasets** page in Soda Cloud. +2. Select a dataset you included in profiling. +3. Open the **Metric Monitors** tab to view automatically detected issues. + +![profile-anomalies](/assets/images/profile-anomalies.png){:height="700px" width="700px"} -3. Soda begins profiling the datasets according to your **Profile** configuration while the algorithm uses the first measurements collected from a scan of your data to begin the work of identifying patterns in the data. You can navigate to the **Dataset** page for a dataset you included in profiling. Click the **Monitors** tab to view the issues Soda automatically detected. +### 🎉 Congratulations! You’ve set up your dataset and enabled observability. -### Congratulations! You’ve set up your dataset and enabled observability. +## What's Next? +Now that your first dataset is configured and observability is active, try: -#### What's Next? -Now that you’ve set up your first dataset and enabled observability, try: -[Exploring detailed metrics in the dashboard.]({% link observability/anomaly-dashboard.md %}) -[Setting up notifications for anomaly detection.]({% link observability/set-up-alerts.md %}) +- [Explore detailed metrics in the anomaly dashboard]({% link observability/anomaly-dashboard.md %}) +- [Set up alerts for anomaly detection]({% link observability/set-up-alerts.md %}) From b03ab1ba84482c491b1e24a7873b1239229d170b Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Santiago=20V=C3=ADquez?= Date: Tue, 25 Mar 2025 21:54:49 -0600 Subject: [PATCH 5/6] update observability structure --- _data/nav.yml | 14 +------------- _includes/what-is-observability.md | 2 -- observability/introduction.md | 19 ------------------- observability/observability.md | 6 +++++- {overview => soda}/overview.md | 0 5 files changed, 6 insertions(+), 35 deletions(-) delete mode 100644 _includes/what-is-observability.md delete mode 100644 observability/introduction.md rename {overview => soda}/overview.md (100%) diff --git a/_data/nav.yml b/_data/nav.yml index bf6a8e95..6c4e6c97 100644 --- a/_data/nav.yml +++ b/_data/nav.yml @@ -3,12 +3,7 @@ page: index.html - title: Soda overview - page: overview/overview.md - subcategories: - - subtitle: Data testing - page: overview/data-testing.md - - subtitle: Observability - page: overview/observability.md + page: soda/overview.md - title: Data testing page: data-testing/data-testing.md @@ -54,13 +49,6 @@ subcategories: - subtitle: Quickstart page: observability/quickstart.md - - subtitle: Introduction - page: observability/introduction.md - subcategories: - - subtitle: Observability - page: observability/what-is-observability.md - - subtitle: Metrics monitoring - page: observability/metrics-monitoring.md - subtitle: How it works/Observability Guide page: get-started/get-started-observability.md subcategories: diff --git a/_includes/what-is-observability.md b/_includes/what-is-observability.md deleted file mode 100644 index fcb44d33..00000000 --- a/_includes/what-is-observability.md +++ /dev/null @@ -1,2 +0,0 @@ - -## What is observability? \ No newline at end of file diff --git a/observability/introduction.md b/observability/introduction.md deleted file mode 100644 index 6523e003..00000000 --- a/observability/introduction.md +++ /dev/null @@ -1,19 +0,0 @@ ---- -layout: default -title: Introduction -description: Introduction -parent: Observability -nav_order: 512 ---- - -# Introduction - -*Last modified on {% last_modified_at %}* - -{% include banner-upgrade.md %} - -{% include what-is-observability.md %} - - - - diff --git a/observability/observability.md b/observability/observability.md index eca015cd..984cf675 100644 --- a/observability/observability.md +++ b/observability/observability.md @@ -9,4 +9,8 @@ nav_order: 500 *Last modified on {% last_modified_at %}* -{% include banner-upgrade.md %} \ No newline at end of file +{% include banner-upgrade.md %} + +OBSERVABILITY INTRO GOES HERE + +## What is Observability? diff --git a/overview/overview.md b/soda/overview.md similarity index 100% rename from overview/overview.md rename to soda/overview.md From 8897c4a50203b92b48e2b78f4c1a18d33d799794 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Santiago=20V=C3=ADquez?= Date: Tue, 25 Mar 2025 23:10:48 -0600 Subject: [PATCH 6/6] observability intro --- observability/observability.md | 39 ++++++++++++++++++++++++++++++++-- 1 file changed, 37 insertions(+), 2 deletions(-) diff --git a/observability/observability.md b/observability/observability.md index 984cf675..1dcfe9ab 100644 --- a/observability/observability.md +++ b/observability/observability.md @@ -11,6 +11,41 @@ nav_order: 500 {% include banner-upgrade.md %} -OBSERVABILITY INTRO GOES HERE +Use observability to monitor data quality at scale across all your datasets. +Observability helps you catch unexpected issues without needing to define every rule up front. -## What is Observability? +Where data testing focuses on known expectations, observability helps you detect the unknown unknowns—like late-arriving records, schema changes, or sudden spikes in missing values. It offers broad, low-effort coverage and requires little configuration, making it easy to share data quality responsibilities across technical and non-technical teams. + +## What is data observability? + +**Data observability** is the practice of continuously monitoring your data for unexpected changes, anomalies, and structural issues. It involves collecting and analyzing metrics about your datasets to understand their health over time. + +Instead of writing checks manually for each dataset, observability uses profiling and metrics to automatically detect problems such as: +- A spike in null values +- A drop in row counts +- Unusual value distributions + +**Data Observability helps you:** +- Detect incidents faster +- Scale coverage across more data +- Reduce time spent on manual testing +- Empower more team members to spot and act on issues + + +## What is metrics monitoring? + +**Metrics monitoring** is the foundation of data observability in Soda. Soda collects metrics from datasets—such as row count, null values, min/max, and value distribution—and tracks how those metrics evolve over time. + +Soda then uses built-in anomaly detection to identify when metrics deviate from expected patterns. These deviations are surfaced in the **Metric Monitors** tab for each dataset. + +You can use metric monitoring to: +- Spot problems without writing checks +- Establish baselines for normal behavior +- Alert data owners when something unusual happens +- Provide insight to business users without requiring code + +## What's Next? +To get started with Soda observability, follow one of these guides: + +- [Data observability quickstart]({% link observability/quickstart.md %}): Set up monitoring to detect anomalies in your datasets. +- [Data observability guide]({% link observability/observability-guide.md %}): Learn how to get the most out of Soda’s data observability platform. \ No newline at end of file