Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
---
name: connect-to-data-source
description: 'Create and troubleshoot AWS Glue connections to JDBC databases (Oracle,
SQL Server, PostgreSQL, MySQL, RDS), Redshift, Snowflake, and BigQuery. Gathers
connection hints from user, discovers existing connections and RDS/Redshift candidates,
registers credentials in Secrets Manager or IAM DB auth, configures VPC, and tests.
Triggers on: connect to database, set up Glue connection, register data source,
connect to Snowflake/BigQuery/RDS, connection timeout, test connection, troubleshoot
connection. Do NOT use for moving data (use ingest-into-data-lake), creating tables
(use create-data-lake-table), queries (use query-data-lake), catalog exploration
(use exploring-data-catalog), or SaaS (Salesforce, ServiceNow, SAP, MongoDB, Kafka).'
name: connecting-to-data-source
description: >-
Create and troubleshoot AWS Glue connections to JDBC databases (Oracle, SQL Server,
PostgreSQL, MySQL, RDS), Redshift, Snowflake, and BigQuery. Gathers connection hints
from user, discovers existing connections and RDS/Redshift candidates, registers
credentials in Secrets Manager or IAM DB auth, configures VPC, and tests. Triggers
on: connect to database, set up Glue connection, register data source, connect to
Snowflake/BigQuery/RDS, connection timeout, test connection, troubleshoot connection.
Do NOT use for moving data (use ingesting-into-data-lake), creating tables (use
creating-data-lake-table), queries (use querying-data-lake), catalog exploration
(use exploring-data-catalog), or SaaS (Salesforce, ServiceNow, SAP, MongoDB, Kafka).
version: 1
metadata:
service: [glue, secretsmanager, rds, redshift]
Expand All @@ -20,7 +21,7 @@ argument-hint: '[source-type|connection-name|hostname]'

# Connect to Data Source

Register an external data source with AWS Glue so downstream skills (ingest-into-data-lake) can move data from it. A Glue connection stores the network config, driver, and credential reference for one source. Create once per source, reuse across jobs.
Register an external data source with AWS Glue so downstream skills (ingesting-into-data-lake) can move data from it. A Glue connection stores the network config, driver, and credential reference for one source. Create once per source, reuse across jobs.

## Philosophy

Expand Down Expand Up @@ -48,7 +49,7 @@ Ask the user which source type they want to connect to, or infer from hints:
| "Snowflake" | Snowflake | `SNOWFLAKE` | [snowflake-setup.md](references/snowflake-setup.md) |
| "BigQuery", "Google analytics warehouse" | BigQuery | `BIGQUERY` | [bigquery-setup.md](references/bigquery-setup.md) |

If the user names DynamoDB or a local file, stop and tell them: DynamoDB is read directly by Glue without a connection, and local files belong in the ingest-into-data-lake skill's local-upload workflow.
If the user names DynamoDB or a local file, stop and tell them: DynamoDB is read directly by Glue without a connection, and local files belong in the ingesting-into-data-lake skill's local-upload workflow.

### 3. Gather Connection Hints from the User

Expand Down Expand Up @@ -126,7 +127,7 @@ After TestConnection passes, verify the connection works with the user's intende

Phase B catches issues that TestConnection misses: driver compatibility at job runtime, catalog configuration, Spark-level serialization, and engine-specific auth flows (e.g., Snowflake SNOWFLAKE type works in ETL but not via JDBC crawlers).

On success in both phases, tell user the connection name is ready for `ingest-into-data-lake`. On failure in either phase, Step 8.
On success in both phases, tell user the connection name is ready for `ingesting-into-data-lake`. On failure in either phase, Step 8.

### 8. Troubleshoot (only if test failed)

Expand Down
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
---
name: creating-data-lake-tables
description: >
Create managed Iceberg tables using Amazon S3 Tables (s3tables API namespace)
with automatic compaction and snapshot management. Sets up table bucket,
namespace, table, schema, Glue catalog registration, partitioning, IAM access control.
Triggers on: create table, data lake table, analytics table, structured data storage,
S3 Tables, Iceberg, Athena table, partitioning strategy, access permissions. Do NOT use
for: importing files (use ingest-into-data-lake), vector storage (use store-and-query-vectors),
querying existing tables (use query-data-lake), or locating existing table
(use find-data-lake-assets).
argument-hint: "[table-description|schema-spec]"
name: creating-data-lake-table
description: >-
Create managed Iceberg tables using Amazon S3 Tables (s3tables API namespace) with
automatic compaction and snapshot management. Sets up table bucket, namespace, table,
schema, Glue catalog registration, partitioning, IAM access control. Triggers on:
create table, data lake table, analytics table, structured data storage, S3 Tables,
Iceberg, Athena table, partitioning strategy, access permissions. Do NOT use for:
importing files (use ingesting-into-data-lake), vector storage (use storing-and-querying-vectors),
querying existing tables (use querying-data-lake), or locating existing table (use
finding-data-lake-assets).
version: 1
metadata:
service: [s3tables, glue, athena]
task: [deploy, debug]
persona: [developer, data-engineer]
workload: [data-analytics]
argument-hint: '[table-description|schema-spec]'
---

# Create Data Lake Tables with Amazon S3 Tables
Expand All @@ -36,15 +36,15 @@ You MUST run `aws glue get-tables --database-name <NAME>` when user mentions a d

| What you find | Action |
|---------------|--------|
| Fuzzy database name ("our analytics db") | You MUST STOP. Delegate to `find-data-lake-assets` to resolve. |
| Non-S3-Tables table with matching name | You MUST STOP. Delegate to `find-data-lake-assets`. You MUST NOT create until user confirms. |
| Fuzzy database name ("our analytics db") | You MUST STOP. Delegate to `finding-data-lake-assets` to resolve. |
| Non-S3-Tables table with matching name | You MUST STOP. Delegate to `finding-data-lake-assets`. You MUST NOT create until user confirms. |
| Existing S3 Tables table with matching name | You MUST check schema match. Reuse if compatible, recreate only if user confirms. |
| No matching tables | Proceed with creation (Steps 1-8). |
| User explicitly requests new S3 Tables table | Skip checks, proceed with creation. |

**Creation paths:**

- **Existing data in S3**: Create empty table (Steps 1-8), then use `ingest-into-data-lake` skill.
- **Existing data in S3**: Create empty table (Steps 1-8), then use `ingesting-into-data-lake` skill.
- **Glue ETL pipeline**: Read `references/table-creation-glue-etl.md` first, then Steps 1-6.
- **Lake Formation access control**: Search AWS docs for `"S3 Tables integration with Lake Formation"`.

Expand All @@ -59,7 +59,7 @@ You MUST run `aws glue get-tables --database-name <NAME>` when user mentions a d

- **Explicit schema**: Validate Iceberg types.
- **Loose description**: Ask columns, types, grain. Propose and confirm.
- **Existing S3 data**: Infer schema from file headers only. Create empty table first, then use `ingest-into-data-lake` skill.
- **Existing S3 data**: Infer schema from file headers only. Create empty table first, then use `ingesting-into-data-lake` skill.

**Constraints:**

Expand Down Expand Up @@ -195,4 +195,4 @@ You MUST verify with `aws s3tables get-table` and confirm queryability with `DES
- [best-practices.md](references/best-practices.md) -- Iceberg types, partitions, naming, common errors
- [athena-ddl-path.md](references/athena-ddl-path.md) -- Athena DDL, schema evolution
- [table-creation-glue-etl.md](references/table-creation-glue-etl.md) -- Spark DDL via Glue ETL
- Loading data: `ingest-into-data-lake` skill
- Loading data: `ingesting-into-data-lake` skill
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
---
name: exploring-data-catalog
description: 'Full inventory and audit of AWS Glue Data Catalog assets across S3 Tables,
Redshift-federated, and remote Iceberg catalogs. Triggers on: inventory the catalog,
audit databases, list all tables, catalog overview, data landscape, enumerate catalogs,
data inventory, search the catalog. Do NOT use for finding specific data (use find-data-lake-assets),
running queries (use query-data-lake), or creating tables (use create-data-lake-table).'
description: >-
Full inventory and audit of AWS Glue Data Catalog assets across S3 Tables, Redshift-federated,
and remote Iceberg catalogs. Triggers on: inventory the catalog, audit databases,
list all tables, catalog overview, data landscape, enumerate catalogs, data inventory,
search the catalog. Do NOT use for finding specific data (use finding-data-lake-assets),
running queries (use querying-data-lake), or creating tables (use creating-data-lake-table).
version: 1
metadata:
service: [glue, s3, s3tables]
Expand Down Expand Up @@ -115,7 +116,7 @@ Resolve the argument in this order; stop at the first match:
- Flag stale tables and missing descriptions
- Suggest partitioning for large unpartitioned tables
- Summary first, details on request
- You MUST NOT execute Athena queries (`start-query-execution`) during discovery; query execution belongs to `query-data-lake`
- You MUST NOT execute Athena queries (`start-query-execution`) during discovery; query execution belongs to `querying-data-lake`

## Troubleshooting

Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
---
name: find-data-lake-assets
description: 'Resolve data lake and lakehouse asset references across Glue Data Catalog,
S3, S3 Tables, and Redshift. Triggers on: find the table, where is our data, which
table has, locate dataset, find data for, search catalog, what tables match, Redshift
name: finding-data-lake-assets
description: >-
Resolve data lake and lakehouse asset references across Glue Data Catalog, S3, S3
Tables, and Redshift. Triggers on: find the table, where is our data, which table
has, locate dataset, find data for, search catalog, what tables match, Redshift
table, lakehouse table, data lake table, warehouse table, reverse lookup S3 path.
Do NOT use for: full catalog audits (use exploring-data-catalog), running queries
(use query-data-lake), creating tables (use create-data-lake-table).'
(use querying-data-lake), creating tables (use creating-data-lake-table).
version: 1
metadata:
service: [glue, s3, s3tables, redshift]
Expand Down
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
---
name: ingest-into-data-lake
description: 'Import data into the AWS data lake from S3 files, local uploads, JDBC
databases (Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora), Amazon Redshift,
Snowflake, BigQuery, DynamoDB, or existing Glue catalog tables (migration). Default
target is S3 Tables; standard Iceberg on a general purpose bucket is supported where
S3 Tables is not adopted. Handles one-time loads, recurring pipelines, migrations.
name: ingesting-into-data-lake
description: >-
Import data into the AWS data lake from S3 files, local uploads, JDBC databases
(Oracle, SQL Server, PostgreSQL, MySQL, RDS, Aurora), Amazon Redshift, Snowflake,
BigQuery, DynamoDB, or existing Glue catalog tables (migration). Default target
is S3 Tables; standard Iceberg on a general purpose bucket is supported where S3
Tables is not adopted. Handles one-time loads, recurring pipelines, migrations.
Triggers on: import data, load data, ingest, sync database, migrate table, move
data to AWS, set up pipeline, ETL, pull from Snowflake, query BigQuery into S3,
export DynamoDB, CTAS, convert to Iceberg. Do NOT use for setting up or troubleshooting
Glue connections (use connect-to-data-source), creating empty tables (use create-data-lake-table),
running queries (use query-data-lake), finding tables by fuzzy name (use find-data-lake-assets),
Glue connections (use connecting-to-data-source), creating empty tables (use creating-data-lake-table),
running queries (use querying-data-lake), finding tables by fuzzy name (use finding-data-lake-assets),
catalog audit (use exploring-data-catalog), or SaaS platforms like Salesforce, ServiceNow,
SAP, MongoDB, Kafka.'
SAP, MongoDB, Kafka.
version: 1
metadata:
service: [glue, s3, s3tables, athena, dynamodb]
Expand All @@ -23,7 +24,7 @@ argument-hint: '[source-path|connection-name|table-name] [--target s3-tables|ice

# Ingest into Data Lake

Move data from a source into a queryable table in the data lake. This skill assumes the source connection (if one is needed) already exists. For Glue connection setup or troubleshooting, delegate to `connect-to-data-source`.
Move data from a source into a queryable table in the data lake. This skill assumes the source connection (if one is needed) already exists. For Glue connection setup or troubleshooting, delegate to `connecting-to-data-source`.

## Philosophy

Expand All @@ -39,7 +40,7 @@ You MUST execute commands using AWS MCP server tools when connected -- they prov

- You MUST check whether AWS MCP tools or AWS CLI are available and inform the user if missing
- You MUST confirm target AWS region and verify credentials with `aws sts get-caller-identity`
- For SageMaker Unified Studio project roles, note that target tables and connections may be scoped to the project. See the caller ARN detection pattern in `query-data-lake`.
- For SageMaker Unified Studio project roles, note that target tables and connections may be scoped to the project. See the caller ARN detection pattern in `querying-data-lake`.

### 2. Classify the Source

Expand All @@ -55,7 +56,7 @@ You MUST execute commands using AWS MCP server tools when connected -- they prov

If the user names Salesforce, ServiceNow, SAP, MongoDB, Kafka, or another SaaS/streaming source, decline -- these are not supported in this release.

If the source table is referenced by a fuzzy or business name ("migrate our orders table", "pull from the sales warehouse"), delegate to `find-data-lake-assets` to resolve before proceeding.
If the source table is referenced by a fuzzy or business name ("migrate our orders table", "pull from the sales warehouse"), delegate to `finding-data-lake-assets` to resolve before proceeding.

### 3. Confirm Connection Exists (if applicable)

Expand All @@ -65,7 +66,7 @@ For JDBC, Snowflake, and BigQuery sources, a Glue connection is required. Check:
aws glue get-connection --name <CONNECTION_NAME> --region <REGION>
```

If the connection does not exist, stop and delegate to `connect-to-data-source` to create and test it. Do not proceed with ingest until the connection is verified.
If the connection does not exist, stop and delegate to `connecting-to-data-source` to create and test it. Do not proceed with ingest until the connection is verified.

Local files, S3 files, DynamoDB, and catalog migration do not need a Glue connection.

Expand All @@ -74,7 +75,7 @@ Local files, S3 files, DynamoDB, and catalog migration do not need a Glue connec
You MUST ask the user (or suggest based on catalog inventory) before creating or writing to any table:

- **Database/namespace**: Does a specific target database exist? Or should one be created?
- **Table**: Existing table (append/merge) or new table (delegate to `create-data-lake-table`)?
- **Table**: Existing table (append/merge) or new table (delegate to `creating-data-lake-table`)?
- **Format**: S3 Tables (default), standard Iceberg, or raw Parquet?

**Inventory-aware defaults:**
Expand All @@ -89,8 +90,8 @@ Do not force S3 Tables on customers who haven't adopted it. See [iceberg-catalog

**Delegations from this step:**

- Target table doesn't exist -> `create-data-lake-table`
- Target database named by fuzzy term -> `find-data-lake-assets`
- Target table doesn't exist -> `creating-data-lake-table`
- Target database named by fuzzy term -> `finding-data-lake-assets`
- User doesn't know what exists -> `exploring-data-catalog`

### 5. Execute Source Workflow
Expand Down Expand Up @@ -132,7 +133,7 @@ For recurring pipelines, create a Glue Trigger with a cron schedule. See [testin
- `overwritePartitions()` only replaces partitions present in the DataFrame -- for full refresh with deletes, use `createOrReplace()`
- Standard Iceberg targets MUST include a LOCATION clause; S3 Tables MUST NOT
- DynamoDB does not need a Glue connection -- do not attempt to create one
- Connection failures during ingest delegate back to `connect-to-data-source`; do not debug network/credentials in this skill
- Connection failures during ingest delegate back to `connecting-to-data-source`; do not debug network/credentials in this skill
- For target tables in SageMaker Unified Studio projects, ensure the project role has write access to the target namespace before the Glue job runs

## Troubleshooting
Expand All @@ -142,7 +143,7 @@ For recurring pipelines, create a Glue Trigger with a cron schedule. See [testin
| Access Denied on S3 | Missing IAM permissions | Check Glue role has s3:GetObject, s3:PutObject |
| Access Denied on S3 Tables | Missing s3tables:* permissions | Add S3 Tables inline policy to Glue role |
| CTAS timeout | Dataset too large for Athena | Switch to Glue ETL or batch with WHERE filters |
| JDBC connection timeout/auth failure | Connection-level issue | Delegate to `connect-to-data-source` |
| JDBC connection timeout/auth failure | Connection-level issue | Delegate to `connecting-to-data-source` |
| Throughput exceeded (DynamoDB) | Read percent too high | Lower `read.percent` or use native export |

See [error-handling.md](references/error-handling.md) for the full catalog.
Expand Down Expand Up @@ -171,7 +172,7 @@ See [error-handling.md](references/error-handling.md) for the full catalog.
- [type-transformations.md](references/type-transformations.md) -- Type conflict resolution
- [format-specific-loading.md](references/format-specific-loading.md) -- CSV/JSON/Parquet/Avro/ORC specifics
- [athena-loading.md](references/athena-loading.md) -- Athena INSERT INTO as simple-load fallback
- [error-handling.md](references/error-handling.md) -- Ingest errors (connection errors delegate to connect-to-data-source)
- [error-handling.md](references/error-handling.md) -- Ingest errors (connection errors delegate to connecting-to-data-source)
- [upload-options.md](references/upload-options.md) -- aws s3 cp vs sync, multipart

### Migration-specific
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# BigQuery Ingest

Move data from Google BigQuery into the data lake. Assumes a Glue `BIGQUERY` connection exists. If not, delegate to `connect-to-data-source`.
Move data from Google BigQuery into the data lake. Assumes a Glue `BIGQUERY` connection exists. If not, delegate to `connecting-to-data-source`.

## Contents

Expand Down
Loading
Loading