Skip to content

Failed events alerts DQD #1319

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
title: "Creating alerts"
sidebar_position: 1
---

Set up alerts to receive notifications when failed events occur in your data pipeline.

## Before you start

- Access to the Data Quality Dashboard

## Create an alert

1. Navigate to **Data Quality** in the left sidebar
2. Click **Manage alerts** in the top-right corner
3. Click **Create alert**

![Create alert form](images/dq_create_alert.png)

### Configure destination

Choose how you want to receive notifications:

#### Email notifications

1. Select **Email** as destination
2. Enter alert name (e.g., "mobile-app")
3. Add recipient email addresses
4. Click **Add filters** to configure triggers

![Email destination configuration](images/dq_create_email_alert.png)

#### Slack notifications

1. Select **Slack** as destination
2. Enter alert name (e.g., "web-app")
3. Select Slack channel from dropdown
4. Click **Add filters** to configure triggers

![Slack destination configuration](images/dq_create_slack_alert.png)

When no active Slack integration a `Connect with Slack` button will apear instead of the list of channels.

![Connect to Slack](images/dq_connect_slack.png)

A Slack consent screen will appear

![Slack consent](images/dq_slack.png)

Once a Slack alert is configured you will see a confirmation notification in the selected Slack channel

![Slack confirmation](images/dq_slack_confirmation.png)

### Set up filters

Configure when alerts should trigger:

1. **Issue types**: Select ValidationError, ResolutionError, or both
2. **Data structures**: Choose specific data structures (all versions will apply)
3. **App IDs**: Filter by application identifiers

![Filter configuration](images/dq_filters.png)

### Complete setup

1. Review your configuration
2. Click **Confirm** to create the alert
3. Your alert will appear in the alerts list

## Alert frequency

Alerts are checked every 10 minutes. You'll receive notifications when new failed events match your filter criteria.

## Next steps

- [Manage existing alerts](/docs/data-product-studio/data-quality/alerts/managing-alerts/index.md)
- [Explore failed events](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/index.md)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
41 changes: 41 additions & 0 deletions docs/data-product-studio/data-quality/alerts/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
title: "New Failed events alerts"
sidebar_position: 4
---

:::info Legacy monitoring
For legacy failed events monitoring, see [monitoring failed events](/docs/data-product-studio/data-quality/failed-events/monitoring-failed-events/index.md).
:::

Failed events alerts automatically notify you when [failed events](/docs/fundamentals/failed-events/index.md) occur in your data pipeline. Set up alerts to receive notifications via email or Slack when validation errors, resolution errors, or other data quality issues arise.

## How alerts work

The alerting system monitors your failed events and sends notifications based on the filters you configure. Alerts are checked every 10 minutes and sent to your specified destinations when matching failed events are detected.

## Alert destinations

- **Email**: Send notifications to one or more email addresses
- **Slack**: Send notifications to specific Slack channels

## What you can filter on

Configure alerts to trigger only for specific types of failed events:

- **Issue types**: ValidationError, ResolutionError
- **Data structures**: Filter by specific schemas or event types
- **App IDs**: Filter by application identifiers

## Getting started

1. Navigate to the Data Quality Dashboard
2. View your failed events overview
3. Click **Manage alerts** to set up notifications
4. Create and configure your first alert

![Data Quality Dashboard overview](images/dq_manage_alerts_button.png)

## Next steps

- [Create your first alert](/docs/data-product-studio/data-quality/alerts/creating-alerts/index.md)
- [Manage existing alerts](/docs/data-product-studio/data-quality/alerts/managing-alerts/index.md)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
title: "Managing alerts"
sidebar_position: 2
---

Edit, delete, or review existing failed events alerts.

## View alerts

1. Navigate to **Data Quality** in the left sidebar
2. Click **Manage alerts** in the top-right corner
3. View all configured alerts with their destinations

![Manage alerts interface](images/dq_list_alerts.png)

## Edit an alert

1. Click the arrow next to the alert name
2. Modify destination, filters, or recipients
3. Click **Save** to update

## Delete an alert

1. Click the arrow next to the alert name
2. Click on the three dots button
3. Click **Delete**
4. Confirm deletion

### Multiple notifications

Alerts trigger when new failed events match your filters. If you receive multiple notifications, check if:
- Failed events are occurring frequently
- Filter criteria are too broad
- Multiple alerts have overlapping configurations

## Next steps

- [Create additional alerts](/docs/data-product-studio/data-quality/alerts/creating-alerts/index.md)
- [Explore failed events](/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/index.md)
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: "Managing data quality"
date: "2020-02-15"
sidebar_position: 3
sidebar_label: "Failed events"
sidebar_label: "Legacy Failed events"
---

[Failed events](/docs/fundamentals/failed-events/index.md) are events the pipeline had some problem processing (for example, events that did not pass validation).
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
---
title: "Troubleshooting data quality dashboard"
sidebar_position: 4
sidebar_custom_props:
offerings:
- bdp
sidebar_label: "Troubleshooting"
---

This guide helps you troubleshoot common errors when using the data quality dashboard with your warehouse.

## Missing warehouse permissions

When deploying a loader with the data quality add-on (API), you may encounter permission errors that prevent the dashboard from querying your warehouse.

### BigQuery: missing `bigquery.jobs.create` permission {#bigquery-permissions}

#### Error code range
- `21xxx`

#### Error description
`Missing permission 'bigquery.jobs.create' on Bigquery...`

#### Root cause
- The service account lacks the required permission to create BigQuery jobs
- This permission can be granted via the `roles/bigquery.jobUser` role

#### How to diagnose
Check if your service account has the required role:

```bash
gcloud projects get-iam-policy <PROJECT_ID> \
--flatten="bindings[].members" \
--filter="bindings.members:<SERVICE_ACCOUNT_EMAIL>" \
--format="table(bindings.role)"
```

#### Fix
Grant the required role to your service account (recommended):

```bash
gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
--role="roles/bigquery.jobUser"
```

Alternatively, if you need more granular control, create a custom role with only the `bigquery.jobs.create` permission:

```bash
gcloud iam roles create customBigQueryJobCreator \
--project=<PROJECT_ID> \
--title="BigQuery Job Creator" \
--description="Create BigQuery jobs for Data Quality Dashboard" \
--permissions="bigquery.jobs.create"

gcloud projects add-iam-policy-binding <PROJECT_ID> \
--member="serviceAccount:<SERVICE_ACCOUNT_EMAIL>" \
--role="projects/<PROJECT_ID>/roles/customBigQueryJobCreator"
```

#### Helpful links
- [BigQuery IAM roles documentation](https://cloud.google.com/bigquery/docs/access-control#bigquery)
- [Service account permissions](https://cloud.google.com/iam/docs/service-accounts)

### Snowflake: missing `USAGE` privilege {#snowflake-permissions}

#### Error code range
- `11xxx`

#### Error description
`Missing required privileges on Snowflake: No active warehouse selected in the current session...`

#### Root cause
- The role lacks `USAGE` privilege on the active warehouse
- Without this privilege, queries cannot be executed

#### How to diagnose
Verify current warehouse privileges for your role:

```sql
SHOW GRANTS ON WAREHOUSE <WAREHOUSE_NAME>;
SHOW GRANTS TO ROLE <ROLE_NAME>;
```

#### Fix
Grant the `USAGE` privilege on the warehouse:

```sql
GRANT USAGE ON WAREHOUSE <WAREHOUSE_NAME> TO ROLE <ROLE_NAME>;
```

Ensure the grant is properly applied:

```sql
-- Verify the grant
SHOW GRANTS ON WAREHOUSE <WAREHOUSE_NAME>;
```

#### Helpful links
- [Snowflake warehouse privileges](https://docs.snowflake.com/en/user-guide/security-access-control-privileges#warehouse-privileges)
- [Snowflake access control](https://docs.snowflake.com/en/user-guide/security-access-control-overview)

## Query timeouts

Long-running queries or resource pool exhaustion can cause the data quality dashboard to time out when fetching failed events.

### BigQuery and Snowflake query timeouts {#query-timeouts}

#### Error code range
- `12xxx`
- `22xxx`

#### Error description
`Query exceeded timeout` or `Query execution time limit exceeded`

#### Root cause
- Large volume of failed events requiring extensive scanning
- Warehouse resource pool exhaustion
- Concurrent query limits reached

#### How to diagnose

**BigQuery:**
```sql
-- Check recent query performance
SELECT
job_id,
user_email,
total_slot_ms,
total_bytes_processed,
TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_seconds
FROM `<PROJECT_ID>.region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT`
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
AND state = 'DONE'
AND statement_type = 'SELECT'
ORDER BY total_slot_ms DESC
LIMIT 10;
```

**Snowflake:**
```sql
-- Check query history
SELECT
query_id,
query_text,
warehouse_name,
execution_time,
queued_overload_time,
bytes_scanned
FROM table(information_schema.query_history())
WHERE start_time >= DATEADD('day', -1, CURRENT_TIMESTAMP())
AND execution_status = 'SUCCESS'
ORDER BY execution_time DESC
LIMIT 10;
```

#### Fix

**Reduce query scope:**
- If using API use smaller time windows (e.g., "Last hour" or "Last day" instead of "Last 30 days")
- If using Console you can change to legacy failed events based on telemetry data
- Query specific error types or schemas when investigating issues

**Optimize warehouse performance:**
- Review your warehouse configuration and query patterns
- Consider implementing partitioning, clustering, or other optimization strategies
- Monitor resource usage, and adjust warehouse size as needed

For detailed optimization guidance, refer to your warehouse documentation:
- [BigQuery query optimization best practices](https://cloud.google.com/bigquery/docs/best-practices-performance-overview)
- [Snowflake query performance optimization](https://docs.snowflake.com/en/user-guide/performance-query-optimization)

#### Helpful links
- [BigQuery query optimization](https://cloud.google.com/bigquery/docs/best-practices-performance-overview)
- [Snowflake query performance](https://docs.snowflake.com/en/user-guide/performance-query-optimization)
- [BigQuery clustering and partitioning](https://cloud.google.com/bigquery/docs/clustered-tables)
- [Snowflake clustering keys](https://docs.snowflake.com/en/user-guide/tables-clustering-keys)

## Additional considerations

### API behavior

- **Missing permissions**: Returns HTTP 400 with remediation instructions displayed in the UI

### Prevention tips

1. **Regular maintenance**:
- Monitor table sizes and query performance
- Review and optimize clustering/partitioning strategies

2. **Proactive monitoring**:
- Monitor query execution times
- Track failed events volume trends

3. **Access control**:
- Document required permissions for all team members
- Use least-privilege principles
- Regularly audit access permissions