This repository uses pytest
:
# create a venv
python3.8 -m venv venv/
# install requirements
venv/bin/pip install -r requirements.txt
# run pytest with all linters and 4 workers in parallel
venv/bin/pytest --black --docstyle --flake8 --mypy-ignore-missing-imports -n 4
# use -k to selectively run a set of tests that matches the expression `udf`
venv/bin/pytest -k udf
# run integration tests with 4 workers in parallel
gcloud auth application-default login # or set GOOGLE_APPLICATION_CREDENTIALS
export GOOGLE_PROJECT_ID="bigquery-etl-integration-test"
venv/bin/pytest -m integration -n 4
To provide authentication credentials for the Google Cloud API the GOOGLE_APPLICATION_CREDENTIALS
environment variable must be set to the file path of the JSON file that contains the service account key.
See Mozilla BigQuery API Access instructions to request credentials if you don't already have them.
Include a comment like -- Tests
followed by one or more query statements
after the UDF in the SQL file where it is defined. Each statement in a SQL file
that defines a UDF that does not define a temporary function is collected as a
test and executed independently of other tests in the file.
Each test must use the UDF and throw an error to fail. Assert functions defined
in tests/assert/
may be used to evaluate outputs. Tests must not use any
query parameters and should not reference any tables. Each test that is
expected to fail must be preceded by a comment like #xfail
, similar to a SQL
dialect prefix in the BigQuery Cloud Console.
For example:
CREATE TEMP FUNCTION udf_example(option INT64) AS (
CASE
WHEN option > 0 then TRUE
WHEN option = 0 then FALSE
ELSE ERROR("invalid option")
END
);
-- Tests
SELECT
assert_true(udf_example(1)),
assert_false(udf_example(0));
#xfail
SELECT
udf_example(-1);
#xfail
SELECT
udf_example(NULL);
- Make a directory for test resources named
tests/{dataset}/{query_name}/{test_name}/
, e.g.tests/telemetry_derived/clients_last_seen_raw_v1/test_single_day
query_name
must match a query file namedsql/{dataset}/{query_name}.sql
, e.g.sql/telemetry_derived/clients_last_seen_v1.sql
test_name
should start withtest_
, e.g.test_single_day
- Add
.yaml
files for input tables, e.g.clients_daily_v6.yaml
- Include the dataset prefix if it's set in the tested query,
e.g.
analysis.clients_last_seen_v1.yaml
- This will result in the dataset prefix being removed from the query,
e.g.
query = query.replace("analysis.clients_last_seen_v1", "clients_last_seen_v1")
- This will result in the dataset prefix being removed from the query,
e.g.
- Include the dataset prefix if it's set in the tested query,
e.g.
- Add
.sql
files for input view queries, e.g.main_summary_v4.sql
- Don't include a
CREATE ... AS
clause - Fully qualify table names as
`{project}.{dataset}.table`
- Include the dataset prefix if it's set in the tested query,
e.g.
telemetry.main_summary_v4.sql
- This will result in the dataset prefix being removed from the query,
e.g.
query = query.replace("telemetry.main_summary_v4", "main_summary_v4")
- This will result in the dataset prefix being removed from the query,
e.g.
- Don't include a
- Add
expect.yaml
to validate the resultDATE
andDATETIME
type columns in the result are coerced to strings using.isoformat()
- Columns named
generated_time
are removed from the result before comparing toexpect
because they should not be static
- Optionally add
.schema.json
files for input table schemas, e.g.clients_daily_v6.schema.json
- Optionally add
query_params.yaml
to define query parametersquery_params
must be a list
- If the destination table is also an input table then
generated_time
should be a requiredDATETIME
field to ensure minimal validation - Input table files
- All of the formats supported by
bq load
are supported yaml
andjson
format are supported and must contain an array of rows which are converted in memory tondjson
before loading- Preferred formats are
yaml
for readability orndjson
for compatiblity withbq load
- All of the formats supported by
expect.yaml
- File extensions
yaml
,json
andndjson
are supported - Preferred formats are
yaml
for readability orndjson
for compatiblity withbq load
- File extensions
- Schema files
- Setting the description of a top level field to
time_partitioning_field
will cause the table to use it for time partitioning - File extensions
yaml
,json
andndjson
are supported - Preferred formats are
yaml
for readability orjson
for compatiblity withbq load
- Setting the description of a top level field to
- Query parameters
- Scalar query params should be defined as a dict with keys
name
,type
ortype_
, andvalue
query_parameters.yaml
may be used instead ofquery_params.yaml
, but they are mutually exclusive- File extensions
yaml
,json
andndjson
are supported - Preferred format is
yaml
for readability
- Scalar query params should be defined as a dict with keys
- Install the CircleCI Local CI
- Download GCP service account keys
- Integration tests will only successfully run with service account keys
that belong to the
circleci
service account in thebiguqery-etl-integration-test
project
- Integration tests will only successfully run with service account keys
that belong to the
- Run
circleci build
and set required environment variablesGOOGLE_PROJECT_ID
andGCLOUD_SERVICE_KEY
:
gcloud_service_key=`cat /path/to/key_file.json`
# to run a specific job, e.g. integration:
circleci build --job integration \
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
--env GCLOUD_SERVICE_KEY=$gcloud_service_key
# to run all jobs
circleci build \
--env GOOGLE_PROJECT_ID=bigquery-etl-integration-test \
--env GCLOUD_SERVICE_KEY=$gcloud_service_key