Skip to content

Commit c6ad0cd

Browse files
committed
initial commit
1 parent f63ec98 commit c6ad0cd

20 files changed

+1666
-5
lines changed

.gitignore

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
*.swp
2+
package-lock.json
3+
__pycache__
4+
.pytest_cache
5+
.venv
6+
*.egg-info
7+
8+
# CDK asset staging directory
9+
.cdk.staging
10+
cdk.out

README.md

+259-5
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,265 @@
1-
## My Project
21

3-
TODO: Fill this README out!
2+
# Web Log Analytics with Amazon Kinesis Data Streams Proxy using Amazon API Gateway
43

5-
Be sure to:
4+
This repository provides you cdk scripts and sample code on how to implement a simple [web analytics](https://en.wikipedia.org/wiki/Web_analytics) system.<br/>
5+
Below diagram shows what we are implementing.
66

7-
* Change the title in this README
8-
* Edit your repository description on GitHub
7+
![web-analytics-arch](web-analytics-arch.svg)
8+
9+
10+
The `cdk.json` file tells the CDK Toolkit how to execute your app.
11+
12+
This project is set up like a standard Python project. The initialization
13+
process also creates a virtualenv within this project, stored under the `.venv`
14+
directory. To create the virtualenv it assumes that there is a `python3`
15+
(or `python` for Windows) executable in your path with access to the `venv`
16+
package. If for any reason the automatic creation of the virtualenv fails,
17+
you can create the virtualenv manually.
18+
19+
To manually create a virtualenv on MacOS and Linux:
20+
21+
```
22+
$ python3 -m venv .venv
23+
```
24+
25+
After the init process completes and the virtualenv is created, you can use the following
26+
step to activate your virtualenv.
27+
28+
```
29+
$ source .venv/bin/activate
30+
```
31+
32+
If you are a Windows platform, you would activate the virtualenv like this:
33+
34+
```
35+
% .venv\Scripts\activate.bat
36+
```
37+
38+
Once the virtualenv is activated, you can install the required dependencies.
39+
40+
```
41+
(.venv) $ pip install -r requirements.txt
42+
```
43+
44+
### Upload Lambda Layer code
45+
46+
Before deployment, you should uplad zipped code files to s3 like this:
47+
<pre>
48+
(.venv) $ aws s3api create-bucket --bucket <i>your-s3-bucket-name-for-lambda-layer-code</i> --region <i>region-name</i>
49+
(.venv) $ ./build-aws-lambda-layer-package.sh <i>your-s3-bucket-name-for-lambda-layer-code</i>
50+
</pre>
51+
(:warning: Make sure you have **Docker** installed.)
52+
53+
For example,
54+
<pre>
55+
(.venv) $ aws s3api create-bucket --bucket lambda-layer-resources --region us-east-1
56+
(.venv) $ ./build-aws-lambda-layer-package.sh lambda-layer-resources
57+
</pre>
58+
59+
For more information about how to create a package for Amazon Lambda Layer, see [here](https://aws.amazon.com/premiumsupport/knowledge-center/lambda-layer-simulated-docker/).
60+
61+
### Deploy
62+
63+
At this point you can now synthesize the CloudFormation template for this code.
64+
65+
<pre>
66+
(.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
67+
(.venv) $ export CDK_DEFAULT_REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)
68+
(.venv) $ cdk synth --all \
69+
--parameters KinesisStreamName='your-kinesis-data-stream-name' \
70+
--parameters FirehoseStreamName='your-delivery-stream-name' \
71+
--parameters FirehosePrefix='your-s3-bucket-prefix' \
72+
--parameters LambdaLayerCodeS3BucketName=<i>'your-s3-bucket-name-for-lambda-layer-code'</i> \
73+
--parameters LambdaLayerCodeS3ObjectKey=<i>'your-s3-object-key-for-lambda-layer-code'</i>
74+
</pre>
75+
76+
Use `cdk deploy` command to create the stack shown above.
77+
78+
<pre>
79+
(.venv) $ cdk deploy --require-approval never --all \
80+
--parameters KinesisStreamName='your-kinesis-data-stream-name' \
81+
--parameters FirehoseStreamName='your-delivery-stream-name' \
82+
--parameters FirehosePrefix='your-s3-bucket-prefix' \
83+
--parameters LambdaLayerCodeS3BucketName=<i>'your-s3-bucket-name-for-lambda-layer-code'</i> \
84+
--parameters LambdaLayerCodeS3ObjectKey=<i>'your-s3-object-key-for-lambda-layer-code'</i>
85+
</pre>
86+
87+
To add additional dependencies, for example other CDK libraries, just add
88+
them to your `setup.py` file and rerun the `pip install -r requirements.txt`
89+
command.
90+
91+
## Run Test
92+
93+
1. Run `GET /streams` method to invoke `ListStreams` in Kinesis
94+
<pre>
95+
$ curl -X GET https://<i>your-api-gateway-id</i>.execute-api.us-east-1.amazonaws.com/v1/streams
96+
</pre>
97+
98+
The response is:
99+
<pre>
100+
{
101+
"HasMoreStreams": false,
102+
"StreamNames": [
103+
"PUT-Firehose-aEhWz"
104+
],
105+
"StreamSummaries": [
106+
{
107+
"StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/PUT-Firehose-aEhWz",
108+
"StreamCreationTimestamp": 1661612556,
109+
"StreamModeDetails": {
110+
"StreamMode": "ON_DEMAND"
111+
},
112+
"StreamName": "PUT-Firehose-aEhWz",
113+
"StreamStatus": "ACTIVE"
114+
}
115+
]
116+
}
117+
</pre>
118+
2. Generate test data.
119+
<pre>
120+
(.venv) $ pip install -r requirements-dev.txt
121+
(.venv) $ python src/utils/gen_fake_data.py --max-count 5 --stream-name <i>PUT-Firehose-aEhWz</i> --api-url 'https://<i>your-api-gateway-id</i>.execute-api.us-east-1.amazonaws.com/v1' --api-method records
122+
[200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260289903462649185194773668901646666226496176178","ShardId":"shardId-000000000003"}]}
123+
[200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260289903462649185194774877827466280924390359090","ShardId":"shardId-000000000003"}]}
124+
[200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260223001227053593325351479598467950537766600706","ShardId":"shardId-000000000000"}]}
125+
[200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260245301972252123948494224242560213528447287314","ShardId":"shardId-000000000001"}]}
126+
[200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260223001227053593325353897450107179933554966530","ShardId":"shardId-000000000000"}]}
127+
</pre>
128+
3. Creating and loading a table with partitioned data in Amazon Athena
129+
130+
Go to [Athena](https://console.aws.amazon.com/athena/home) on the AWS Management console.<br/>
131+
* (step 1) Create a database
132+
133+
In order to create a new database called `mydatabase`, enter the following statement in the Athena query editor
134+
and click the **Run** button to execute the query.
135+
136+
<pre>
137+
CREATE DATABASE mydatabase
138+
</pre>
139+
140+
* (step 2) Create a table
141+
142+
Copy the following query into the Athena query editor, replace the `xxxxxxx` in the last line under `LOCATION` with the string of your S3 bucket, and execute the query to create a new table.
143+
<pre>
144+
CREATE EXTERNAL TABLE `mydatabase.web_log_json`(
145+
`userId` string,
146+
`sessionId` string,
147+
`referrer` string,
148+
`userAgent` string,
149+
`ip` string,
150+
`hostname` string,
151+
`os` string,
152+
`timestamp` timestamp,
153+
`uri` string)
154+
PARTITIONED BY (
155+
`year` int,
156+
`month` int,
157+
`day` int,
158+
`hour` int)
159+
ROW FORMAT SERDE
160+
'org.openx.data.jsonserde.JsonSerDe'
161+
STORED AS INPUTFORMAT
162+
'org.apache.hadoop.mapred.TextInputFormat'
163+
OUTPUTFORMAT
164+
'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
165+
LOCATION
166+
's3://web-analytics-<i>xxxxx</i>/json-data'
167+
</pre>
168+
If the query is successful, a table named `web_log_json` is created and displayed on the left panel under the **Tables** section.
169+
170+
If you get an error, check if (a) you have updated the `LOCATION` to the correct S3 bucket name, (b) you have mydatabase selected under the Database dropdown, and (c) you have `AwsDataCatalog` selected as the **Data source**.
171+
172+
* (step 3) Load the partition data
173+
174+
Run the following query to load the partition data.
175+
<pre>
176+
MSCK REPAIR TABLE mydatabase.web_log_json;
177+
</pre>
178+
After you run this command, the data is ready for querying.
179+
180+
4. Run test query
181+
182+
Enter the following SQL statement and execute the query.
183+
<pre>
184+
SELECT COUNT(*)
185+
FROM mydatabase.web_log_json;
186+
</pre>
187+
5. Merge small files into large one
188+
189+
When real-time incoming data is stored in S3 using Kinesis Data Firehose, files with small data size are created.<br/>
190+
To improve the query performance of Amazon Athena, it is recommended to combine small files into one large file.<br/>
191+
Also, it is better to use columnar dataformat (e.g., `Parquet`, `ORC`, `AVRO`, etc) instead of `JSON` in Amazon Athena.<br/>
192+
Now we create an Athena table to query for large files that are created by periodical merge files task.
193+
<pre>
194+
CREATE EXTERNAL TABLE `mydatabase.web_log_parquet`(
195+
`userId` string,
196+
`sessionId` string,
197+
`referrer` string,
198+
`userAgent` string,
199+
`ip` string,
200+
`hostname` string,
201+
`os` string,
202+
`timestamp` timestamp,
203+
`uri` string)
204+
PARTITIONED BY (
205+
`year` int,
206+
`month` int,
207+
`day` int,
208+
`hour` int)
209+
ROW FORMAT SERDE
210+
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
211+
STORED AS INPUTFORMAT
212+
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
213+
OUTPUTFORMAT
214+
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
215+
LOCATION
216+
's3://web-analytics-<i>xxxxx</i>/parquet-data'
217+
</pre>
218+
After creating the table and once merge files task is completed, the data is ready for querying.
219+
220+
## Clean Up
221+
222+
Delete the CloudFormation stack by running the below command.
223+
<pre>
224+
(.venv) $ cdk destroy --force --all
225+
</pre>
226+
227+
228+
## Useful commands
229+
230+
* `cdk ls` list all stacks in the app
231+
* `cdk synth` emits the synthesized CloudFormation template
232+
* `cdk deploy` deploy this stack to your default AWS account/region
233+
* `cdk diff` compare deployed stack with current state
234+
* `cdk docs` open CDK documentation
235+
236+
Enjoy!
237+
238+
## References
239+
240+
* [Web Analytics](https://en.wikipedia.org/wiki/Web_analytics)
241+
* [Tutorial: Create a REST API as an Amazon Kinesis proxy in API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/integrating-api-with-aws-services-kinesis.html)
242+
* [Streaming Data Solution for Amazon Kinesis](https://aws.amazon.com/ko/solutions/implementations/aws-streaming-data-solution-for-amazon-kinesis/)
243+
<div>
244+
<img src="https://d1.awsstatic.com/Solutions/Solutions%20Category%20Template%20Draft/Solution%20Architecture%20Diagrams/aws-streaming-data-using-api-gateway-architecture.1b9d28f061fe84385cb871ec58ccad18c7265d22.png", alt with="385" height="204">
245+
</div>
246+
* [Serverless Patterns Collection](https://serverlessland.com/patterns)
247+
* [aws-samples/serverless-patterns](https://github.com/aws-samples/serverless-patterns)
248+
* [Building fine-grained authorization using Amazon Cognito, API Gateway, and IAM](https://aws.amazon.com/ko/blogs/security/building-fine-grained-authorization-using-amazon-cognito-api-gateway-and-iam/)
249+
* [Amazon Athena Workshop](https://athena-in-action.workshop.aws/)
250+
* [Curl Cookbook](https://catonmat.net/cookbooks/curl)
251+
* [fastavro](https://fastavro.readthedocs.io/) - Fast read/write of `AVRO` files
252+
* [Apache Avro Specification](https://avro.apache.org/docs/current/spec.html)
253+
* [How to create a Lambda layer using a simulated Lambda environment with Docker](https://aws.amazon.com/premiumsupport/knowledge-center/lambda-layer-simulated-docker/)
254+
```
255+
$ cat <<EOF > requirements-Lambda-Layer.txt
256+
> fastavro==1.6.1
257+
> EOF
258+
$ docker run -v "$PWD":/var/task "public.ecr.aws/sam/build-python3.9" /bin/sh -c "pip install -r requirements-Lambda-Layer.txt -t python/lib/python3.9/site-packages/; exit"
259+
$ zip -r fastavro-lib.zip python > /dev/null
260+
$ aws s3 mb s3://my-bucket-for-lambda-layer-packages
261+
$ aws s3 cp fastavro-lib.zip s3://my-bucket-for-lambda-layer-packages/
262+
```
9263

10264
## Security
11265

app.py

+37
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
#!/usr/bin/env python3
2+
import os
3+
4+
import aws_cdk as cdk
5+
6+
from web_analytics import (
7+
KdsProxyApiGwStack,
8+
KdsStack,
9+
FirehoseDataTransformLambdaStack,
10+
FirehoseStack,
11+
MergeSmallFilesLambdaStack,
12+
VpcStack
13+
)
14+
15+
AWS_ENV = cdk.Environment(account=os.getenv('CDK_DEFAULT_ACCOUNT'),
16+
region=os.getenv('CDK_DEFAULT_REGION'))
17+
18+
app = cdk.App()
19+
vpc_stack = VpcStack(app, 'WebAnalyticsVpc',
20+
env=AWS_ENV)
21+
22+
kds_proxy_apigw = KdsProxyApiGwStack(app, 'WebAnalyticsKdsProxyApiGw')
23+
kds_stack = KdsStack(app, 'WebAnalyticsKinesisStream')
24+
25+
firehose_data_transform_lambda = FirehoseDataTransformLambdaStack(app,
26+
'WebAnalyticsFirehoseDataTransformLambda')
27+
28+
firehose_stack = FirehoseStack(app, 'WebAnalyticsFirehose',
29+
kds_stack.target_kinesis_stream.stream_arn,
30+
firehose_data_transform_lambda.schema_validator_lambda_fn)
31+
32+
merge_small_files_stack = MergeSmallFilesLambdaStack(app,
33+
'WebAnalyticsMergeSmallFiles',
34+
firehose_stack.s3_dest_bucket_name
35+
)
36+
37+
app.synth()

build-aws-lambda-layer-package.sh

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/bin/bash -
2+
3+
LAMBDA_LAYER_NAME=fastavro-lib
4+
S3_PATH=$1
5+
6+
docker run -v "$PWD":/var/task "public.ecr.aws/sam/build-python3.9" /bin/sh -c "pip install fastavro==1.6.1 -t python/lib/python3.9/site-packages/; exit"
7+
8+
zip -q -r ${LAMBDA_LAYER_NAME}.zip python >/dev/null
9+
aws s3 cp --quiet ${LAMBDA_LAYER_NAME}.zip s3://${S3_PATH}/${LAMBDA_LAYER_NAME}.zip
10+
echo "[Lambda_Layer_Code_S3_Path] s3://${S3_PATH}/${LAMBDA_LAYER_NAME}.zip"
11+

cdk.json

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
{
2+
"app": "python3 app.py",
3+
"watch": {
4+
"include": [
5+
"**"
6+
],
7+
"exclude": [
8+
"README.md",
9+
"cdk*.json",
10+
"requirements*.txt",
11+
"source.bat",
12+
"**/__init__.py",
13+
"python/__pycache__",
14+
"tests"
15+
]
16+
},
17+
"context": {
18+
"@aws-cdk/aws-apigateway:usagePlanKeyOrderInsensitiveId": true,
19+
"@aws-cdk/core:stackRelativeExports": true,
20+
"@aws-cdk/aws-rds:lowercaseDbIdentifier": true,
21+
"@aws-cdk/aws-lambda:recognizeVersionProps": true,
22+
"@aws-cdk/aws-lambda:recognizeLayerVersion": true,
23+
"@aws-cdk/aws-cloudfront:defaultSecurityPolicyTLSv1.2_2021": true,
24+
"@aws-cdk-containers/ecs-service-extensions:enableDefaultLogDriver": true,
25+
"@aws-cdk/aws-ec2:uniqueImdsv2TemplateName": true,
26+
"@aws-cdk/core:checkSecretUsage": true,
27+
"@aws-cdk/aws-iam:minimizePolicies": true,
28+
"@aws-cdk/aws-ecs:arnFormatIncludesClusterName": true,
29+
"@aws-cdk/core:validateSnapshotRemovalPolicy": true,
30+
"@aws-cdk/aws-codepipeline:crossAccountKeyAliasStackSafeResourceName": true,
31+
"@aws-cdk/aws-s3:createDefaultLoggingPolicy": true,
32+
"@aws-cdk/aws-sns-subscriptions:restrictSqsDescryption": true,
33+
"@aws-cdk/aws-apigateway:disableCloudWatchRole": true,
34+
"@aws-cdk/core:enablePartitionLiterals": true,
35+
"@aws-cdk/core:target-partitions": [
36+
"aws",
37+
"aws-cn"
38+
]
39+
}
40+
}

requirements-dev.txt

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
boto3>=1.24.41
2+
mimesis==6.0.0
3+
requests==2.28.1
4+
5+
# packages for Lambda Layer
6+
fastavro==1.6.1

requirements.txt

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
aws-cdk-lib==2.40.0
2+
constructs>=10.0.0,<11.0.0
3+
4+
# https://github.com/aws/aws-cdk/issues/21902
5+
# issue: Type 'aws-cdk-lib.aws_apigateway.EmptyModel' not found
6+
# solution: Downgrade the jsii version to 1.66.0 works
7+
jsii==1.66.0

0 commit comments

Comments
 (0)