Skip to content

astronomer/snowpatrol

Repository files navigation

SnowPatrol

SnowPatrol is an application for anomaly detection and alerting of Snowflake usage powered by Machine Learning. It’s also an MLOps reference implementation, an example of how to use Airflow as a way to manage the training, testing, deployment, and monitoring of predictive models.

At Astronomer, we firmly believe in the power of open source and sharing knowledge. We are excited to share this MLOps reference implementation with the community and hope it will be useful to others.

anomalies.png

Project Structure

Business Objective

Astronomer is a data-driven company, and we rely heavily on Snowflake to store and analyze our data. Self-service analytics is a core part of our culture, and we encourage our teams to answer questions with data themselves, either by running SQL queries or building their own data visualizations. As such, a large number of users and service accounts run queries on a daily basis; on average more than 900k queries are run daily. The majority of these queries come from automated Airflow DAGs running on various schedules. Identifying which automated process or which user ran a query causing increased usage on a given day is time-consuming when done manually. We would rather not waste time chasing cars. Cost management is also a key part of our operations. Just like most organizations, we want to avoid overages and control our Snowflake costs. While that is a common goal, it can be challenging to achieve. Snowflake costs are complex and can be attributed to a variety of factors.

SnowPatrol aims to identify anomalous usage activity to allow management and timely correction of activities.

In addition to anomalous usage activity, SnowPatrol can also track the Snowflake costs associated with every Airflow DAG and Task. Install the Astronomer SnowPatrol Plugin in your Airflow Deployments to automatically add Airflow Metadata to every Snowflake Query through Query Tags.

To understand how the ML Model was built, refer to the MODELING documentation page.

Project Setup

Prerequisites

To use SnowPatrol in your Organization you need:

Snowflake Permissions

Step 1: Create a Role named snowpatrol and grant it the USAGE_VIEWER and ORGANIZATION_BILLING_VIEWER permissions. This is needed so that SnowPatrol can query the following Tables:

  • Schemas:
    • Database SNOWFLAKE
    • Schema ORGANIZATION_USAGE
    • Schema ACCOUNT_USAGE
  • Tables:
    • SNOWFLAKE.ORGANIZATION_USAGE.USAGE_IN_CURRENCY_DAILY
    • SNOWFLAKE.ORGANIZATION_USAGE.WAREHOUSE_METERING_HISTORY
    • SNOWFLAKE.ACCOUNT_USAGE.METERING_HISTORY
    • SNOWFLAKE.ACCOUNT_USAGE.TABLE_STORAGE_METRICS
    • SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORY
CREATE ROLE snowpatrol COMMENT = 'This role has USAGE_VIEWER and ORGANIZATION_BILLING_VIEWER privilege';

GRANT DATABASE ROLE USAGE_VIEWER TO ROLE snowpatrol;
GRANT DATABASE ROLE ORGANIZATION_BILLING_VIEWER TO ROLE snowpatrol;

Step 2: SnowPatrol also needs access to create tables and write data in a dedicated database schema.

GRANT USAGE ON DATABASE <database> TO ROLE snowpatrol;
GRANT ALL PRIVILEGES ON SCHEMA <database>.snowpatrol TO ROLE snowpatrol;

Step 3: Grant the Role to the User or Service Account you plan to use to connect to Snowflake.

GRANT ROLE snowpatrol TO <user>;
GRANT ROLE snowpatrol TO <service_account>;

Setup

  1. Install Astronomer's Astro CLI. The Astro CLI is an Apache 2.0 licensed, open-source tool for building Airflow instances and provides the fastest and easiest way to be up and running with Airflow in minutes. The Astro CLI is a Docker Compose-based tool and integrates easily with Weights and Biases for a local developer environment.

    To install the Astro CLI, open a terminal window and run:

    For MacOS

    brew install astro

    For Linux

    curl -sSL install.astronomer.io | sudo bash -s
  2. Clone this repository:

    git clone https://github.com/astronomer/snowpatrol
    cd snowpatrol
  3. Create a file called .env with the following connection strings and environment variables. To make this easier, we have included a .env.example file that you can rename to .env.

    • WANDB_API_KEY: The API KEY should have access to the Weights and Biases snowpatrol entity and snowpatrol project. Example:

      WANDB_API_KEY:'xxxxxxxxxxx'
      
    • AIRFLOW_CONN_SNOWFLAKE_CONN: This connection string is used for extracting the usage data to the project schema. The user should have access to a role with permissions to read the SNOWFLAKE.ORGANIZATION_USAGE.WAREHOUSE_METERING_HISTORY view Example:

      AIRFLOW_CONN_SNOWFLAKE_CONN='{"conn_type": "snowflake", "login": "<username>", "password": "<password>", "schema": "<schema>", "extra": {"account": "<account>", "warehouse": "<warehouse>", "role": "<role>", "authenticator": "snowflake", "application": "AIRFLOW"}}'
      
    • AIRFLOW_CONN_SLACK_API_ALERT: Add a Slack token for sending Slack alerts. Example:

      AIRFLOW_CONN_SLACK_API_ALERT='{"conn_type": "slack", "password": "xoxb-<>"}'
      
    • SNOWFLAKE_ACCOUNT_NUMBER: Your Snowflake Account Number Example:

      SNOWFLAKE_ACCOUNT_NUMBER=<account_number>
      
    • SNOWFLAKE_DATASET_DB: The Snowflake Database to use when creating the SnowPatrol tables Example:

      SNOWFLAKE_DATASET_DB=<my_db>
      
    • SNOWFLAKE_DATASET_SCHEMA: The Snowflake Schema to use when creating the SnowPatrol tables Example:

      SNOWFLAKE_DATASET_SCHEMA=<my_schema>
      
  4. Start Apache Airflow

    astro dev start

    Airflow starts a browser window that should open to http://localhost:8080. Log in with the following credentials:

    • username: admin
    • password: admin
  5. Run the initial_setup DAG to create the necessary Snowflake tables.

  6. Run the data_ingestion DAG to load daily warehouse metering data.

  7. Run the data_preparation DAG: After the data_preparation DAG runs it will trigger the train_isolation_forest DAG.

  8. After the data_preparation and train_isolation_forest DAGs run, Airflow will trigger the predict_isolation_forest DAG.

  9. Deploy to Astro: Complete the following steps to promote from Airflow running locally to a production deployment in Astro.

    • Log in to Astro from the CLI.
    astro login
    astro deployment create -n 'snowpatrol'
    astro deployment variable update -lsn 'snowpatrol'
    astro deploy -f

    The variable update will load variables from the .env file that was created in step #3.

  10. Login to Astro and ensure that all the DAGs are unpaused. Every night the data_ingestion, data_preparation, data_reporting, train_isolation_forest and predict_isolation_forest DAGs will run. Alerts will be sent to the channel specified in slack_channel in the predict_isolation_forest DAG.

  11. Configure the GitHub Integration in Astro to implement CI/CD for Apache Airflow and deploy code to Astro. This is the fastest way to deploy new changes. See the documentation for more details.

Feedback

Give us your feedback, comments and ideas at https://github.com/astronomer/snowpatrol/discussions

About

Snowflake Usage Anomaly Detection & Alerting System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published