|
1 |
| -## My Project |
2 | 1 |
|
3 |
| -TODO: Fill this README out! |
| 2 | +# Web Log Analytics with Amazon Kinesis Data Streams Proxy using Amazon API Gateway |
4 | 3 |
|
5 |
| -Be sure to: |
| 4 | +This repository provides you cdk scripts and sample code on how to implement a simple [web analytics](https://en.wikipedia.org/wiki/Web_analytics) system.<br/> |
| 5 | +Below diagram shows what we are implementing. |
6 | 6 |
|
7 |
| -* Change the title in this README |
8 |
| -* Edit your repository description on GitHub |
| 7 | + |
| 8 | + |
| 9 | + |
| 10 | +The `cdk.json` file tells the CDK Toolkit how to execute your app. |
| 11 | + |
| 12 | +This project is set up like a standard Python project. The initialization |
| 13 | +process also creates a virtualenv within this project, stored under the `.venv` |
| 14 | +directory. To create the virtualenv it assumes that there is a `python3` |
| 15 | +(or `python` for Windows) executable in your path with access to the `venv` |
| 16 | +package. If for any reason the automatic creation of the virtualenv fails, |
| 17 | +you can create the virtualenv manually. |
| 18 | + |
| 19 | +To manually create a virtualenv on MacOS and Linux: |
| 20 | + |
| 21 | +``` |
| 22 | +$ python3 -m venv .venv |
| 23 | +``` |
| 24 | + |
| 25 | +After the init process completes and the virtualenv is created, you can use the following |
| 26 | +step to activate your virtualenv. |
| 27 | + |
| 28 | +``` |
| 29 | +$ source .venv/bin/activate |
| 30 | +``` |
| 31 | + |
| 32 | +If you are a Windows platform, you would activate the virtualenv like this: |
| 33 | + |
| 34 | +``` |
| 35 | +% .venv\Scripts\activate.bat |
| 36 | +``` |
| 37 | + |
| 38 | +Once the virtualenv is activated, you can install the required dependencies. |
| 39 | + |
| 40 | +``` |
| 41 | +(.venv) $ pip install -r requirements.txt |
| 42 | +``` |
| 43 | + |
| 44 | +### Upload Lambda Layer code |
| 45 | + |
| 46 | +Before deployment, you should uplad zipped code files to s3 like this: |
| 47 | +<pre> |
| 48 | +(.venv) $ aws s3api create-bucket --bucket <i>your-s3-bucket-name-for-lambda-layer-code</i> --region <i>region-name</i> |
| 49 | +(.venv) $ ./build-aws-lambda-layer-package.sh <i>your-s3-bucket-name-for-lambda-layer-code</i> |
| 50 | +</pre> |
| 51 | +(:warning: Make sure you have **Docker** installed.) |
| 52 | + |
| 53 | +For example, |
| 54 | +<pre> |
| 55 | +(.venv) $ aws s3api create-bucket --bucket lambda-layer-resources --region us-east-1 |
| 56 | +(.venv) $ ./build-aws-lambda-layer-package.sh lambda-layer-resources |
| 57 | +</pre> |
| 58 | + |
| 59 | +For more information about how to create a package for Amazon Lambda Layer, see [here](https://aws.amazon.com/premiumsupport/knowledge-center/lambda-layer-simulated-docker/). |
| 60 | + |
| 61 | +### Deploy |
| 62 | + |
| 63 | +At this point you can now synthesize the CloudFormation template for this code. |
| 64 | + |
| 65 | +<pre> |
| 66 | +(.venv) $ export CDK_DEFAULT_ACCOUNT=$(aws sts get-caller-identity --query Account --output text) |
| 67 | +(.venv) $ export CDK_DEFAULT_REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region) |
| 68 | +(.venv) $ cdk synth --all \ |
| 69 | + --parameters KinesisStreamName='your-kinesis-data-stream-name' \ |
| 70 | + --parameters FirehoseStreamName='your-delivery-stream-name' \ |
| 71 | + --parameters FirehosePrefix='your-s3-bucket-prefix' \ |
| 72 | + --parameters LambdaLayerCodeS3BucketName=<i>'your-s3-bucket-name-for-lambda-layer-code'</i> \ |
| 73 | + --parameters LambdaLayerCodeS3ObjectKey=<i>'your-s3-object-key-for-lambda-layer-code'</i> |
| 74 | +</pre> |
| 75 | + |
| 76 | +Use `cdk deploy` command to create the stack shown above. |
| 77 | + |
| 78 | +<pre> |
| 79 | +(.venv) $ cdk deploy --require-approval never --all \ |
| 80 | + --parameters KinesisStreamName='your-kinesis-data-stream-name' \ |
| 81 | + --parameters FirehoseStreamName='your-delivery-stream-name' \ |
| 82 | + --parameters FirehosePrefix='your-s3-bucket-prefix' \ |
| 83 | + --parameters LambdaLayerCodeS3BucketName=<i>'your-s3-bucket-name-for-lambda-layer-code'</i> \ |
| 84 | + --parameters LambdaLayerCodeS3ObjectKey=<i>'your-s3-object-key-for-lambda-layer-code'</i> |
| 85 | +</pre> |
| 86 | + |
| 87 | +To add additional dependencies, for example other CDK libraries, just add |
| 88 | +them to your `setup.py` file and rerun the `pip install -r requirements.txt` |
| 89 | +command. |
| 90 | + |
| 91 | +## Run Test |
| 92 | + |
| 93 | +1. Run `GET /streams` method to invoke `ListStreams` in Kinesis |
| 94 | + <pre> |
| 95 | + $ curl -X GET https://<i>your-api-gateway-id</i>.execute-api.us-east-1.amazonaws.com/v1/streams |
| 96 | + </pre> |
| 97 | + |
| 98 | + The response is: |
| 99 | + <pre> |
| 100 | + { |
| 101 | + "HasMoreStreams": false, |
| 102 | + "StreamNames": [ |
| 103 | + "PUT-Firehose-aEhWz" |
| 104 | + ], |
| 105 | + "StreamSummaries": [ |
| 106 | + { |
| 107 | + "StreamARN": "arn:aws:kinesis:us-east-1:123456789012:stream/PUT-Firehose-aEhWz", |
| 108 | + "StreamCreationTimestamp": 1661612556, |
| 109 | + "StreamModeDetails": { |
| 110 | + "StreamMode": "ON_DEMAND" |
| 111 | + }, |
| 112 | + "StreamName": "PUT-Firehose-aEhWz", |
| 113 | + "StreamStatus": "ACTIVE" |
| 114 | + } |
| 115 | + ] |
| 116 | + } |
| 117 | + </pre> |
| 118 | +2. Generate test data. |
| 119 | + <pre> |
| 120 | + (.venv) $ pip install -r requirements-dev.txt |
| 121 | + (.venv) $ python src/utils/gen_fake_data.py --max-count 5 --stream-name <i>PUT-Firehose-aEhWz</i> --api-url 'https://<i>your-api-gateway-id</i>.execute-api.us-east-1.amazonaws.com/v1' --api-method records |
| 122 | + [200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260289903462649185194773668901646666226496176178","ShardId":"shardId-000000000003"}]} |
| 123 | + [200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260289903462649185194774877827466280924390359090","ShardId":"shardId-000000000003"}]} |
| 124 | + [200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260223001227053593325351479598467950537766600706","ShardId":"shardId-000000000000"}]} |
| 125 | + [200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260245301972252123948494224242560213528447287314","ShardId":"shardId-000000000001"}]} |
| 126 | + [200 OK] {"EncryptionType":"KMS","FailedRecordCount":0,"Records":[{"SequenceNumber":"49633315260223001227053593325353897450107179933554966530","ShardId":"shardId-000000000000"}]} |
| 127 | + </pre> |
| 128 | +3. Creating and loading a table with partitioned data in Amazon Athena |
| 129 | + |
| 130 | + Go to [Athena](https://console.aws.amazon.com/athena/home) on the AWS Management console.<br/> |
| 131 | + * (step 1) Create a database |
| 132 | + |
| 133 | + In order to create a new database called `mydatabase`, enter the following statement in the Athena query editor |
| 134 | + and click the **Run** button to execute the query. |
| 135 | + |
| 136 | + <pre> |
| 137 | + CREATE DATABASE mydatabase |
| 138 | + </pre> |
| 139 | + |
| 140 | + * (step 2) Create a table |
| 141 | + |
| 142 | + Copy the following query into the Athena query editor, replace the `xxxxxxx` in the last line under `LOCATION` with the string of your S3 bucket, and execute the query to create a new table. |
| 143 | + <pre> |
| 144 | + CREATE EXTERNAL TABLE `mydatabase.web_log_json`( |
| 145 | + `userId` string, |
| 146 | + `sessionId` string, |
| 147 | + `referrer` string, |
| 148 | + `userAgent` string, |
| 149 | + `ip` string, |
| 150 | + `hostname` string, |
| 151 | + `os` string, |
| 152 | + `timestamp` timestamp, |
| 153 | + `uri` string) |
| 154 | + PARTITIONED BY ( |
| 155 | + `year` int, |
| 156 | + `month` int, |
| 157 | + `day` int, |
| 158 | + `hour` int) |
| 159 | + ROW FORMAT SERDE |
| 160 | + 'org.openx.data.jsonserde.JsonSerDe' |
| 161 | + STORED AS INPUTFORMAT |
| 162 | + 'org.apache.hadoop.mapred.TextInputFormat' |
| 163 | + OUTPUTFORMAT |
| 164 | + 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat' |
| 165 | + LOCATION |
| 166 | + 's3://web-analytics-<i>xxxxx</i>/json-data' |
| 167 | + </pre> |
| 168 | + If the query is successful, a table named `web_log_json` is created and displayed on the left panel under the **Tables** section. |
| 169 | + |
| 170 | + If you get an error, check if (a) you have updated the `LOCATION` to the correct S3 bucket name, (b) you have mydatabase selected under the Database dropdown, and (c) you have `AwsDataCatalog` selected as the **Data source**. |
| 171 | + |
| 172 | + * (step 3) Load the partition data |
| 173 | + |
| 174 | + Run the following query to load the partition data. |
| 175 | + <pre> |
| 176 | + MSCK REPAIR TABLE mydatabase.web_log_json; |
| 177 | + </pre> |
| 178 | + After you run this command, the data is ready for querying. |
| 179 | + |
| 180 | +4. Run test query |
| 181 | + |
| 182 | + Enter the following SQL statement and execute the query. |
| 183 | + <pre> |
| 184 | + SELECT COUNT(*) |
| 185 | + FROM mydatabase.web_log_json; |
| 186 | + </pre> |
| 187 | +5. Merge small files into large one |
| 188 | + |
| 189 | + When real-time incoming data is stored in S3 using Kinesis Data Firehose, files with small data size are created.<br/> |
| 190 | + To improve the query performance of Amazon Athena, it is recommended to combine small files into one large file.<br/> |
| 191 | + Also, it is better to use columnar dataformat (e.g., `Parquet`, `ORC`, `AVRO`, etc) instead of `JSON` in Amazon Athena.<br/> |
| 192 | + Now we create an Athena table to query for large files that are created by periodical merge files task. |
| 193 | + <pre> |
| 194 | + CREATE EXTERNAL TABLE `mydatabase.web_log_parquet`( |
| 195 | + `userId` string, |
| 196 | + `sessionId` string, |
| 197 | + `referrer` string, |
| 198 | + `userAgent` string, |
| 199 | + `ip` string, |
| 200 | + `hostname` string, |
| 201 | + `os` string, |
| 202 | + `timestamp` timestamp, |
| 203 | + `uri` string) |
| 204 | + PARTITIONED BY ( |
| 205 | + `year` int, |
| 206 | + `month` int, |
| 207 | + `day` int, |
| 208 | + `hour` int) |
| 209 | + ROW FORMAT SERDE |
| 210 | + 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' |
| 211 | + STORED AS INPUTFORMAT |
| 212 | + 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' |
| 213 | + OUTPUTFORMAT |
| 214 | + 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| 215 | + LOCATION |
| 216 | + 's3://web-analytics-<i>xxxxx</i>/parquet-data' |
| 217 | + </pre> |
| 218 | + After creating the table and once merge files task is completed, the data is ready for querying. |
| 219 | + |
| 220 | +## Clean Up |
| 221 | + |
| 222 | +Delete the CloudFormation stack by running the below command. |
| 223 | +<pre> |
| 224 | +(.venv) $ cdk destroy --force --all |
| 225 | +</pre> |
| 226 | + |
| 227 | + |
| 228 | +## Useful commands |
| 229 | + |
| 230 | + * `cdk ls` list all stacks in the app |
| 231 | + * `cdk synth` emits the synthesized CloudFormation template |
| 232 | + * `cdk deploy` deploy this stack to your default AWS account/region |
| 233 | + * `cdk diff` compare deployed stack with current state |
| 234 | + * `cdk docs` open CDK documentation |
| 235 | + |
| 236 | +Enjoy! |
| 237 | + |
| 238 | +## References |
| 239 | + |
| 240 | + * [Web Analytics](https://en.wikipedia.org/wiki/Web_analytics) |
| 241 | + * [Tutorial: Create a REST API as an Amazon Kinesis proxy in API Gateway](https://docs.aws.amazon.com/apigateway/latest/developerguide/integrating-api-with-aws-services-kinesis.html) |
| 242 | + * [Streaming Data Solution for Amazon Kinesis](https://aws.amazon.com/ko/solutions/implementations/aws-streaming-data-solution-for-amazon-kinesis/) |
| 243 | + <div> |
| 244 | + <img src="https://d1.awsstatic.com/Solutions/Solutions%20Category%20Template%20Draft/Solution%20Architecture%20Diagrams/aws-streaming-data-using-api-gateway-architecture.1b9d28f061fe84385cb871ec58ccad18c7265d22.png", alt with="385" height="204"> |
| 245 | + </div> |
| 246 | + * [Serverless Patterns Collection](https://serverlessland.com/patterns) |
| 247 | + * [aws-samples/serverless-patterns](https://github.com/aws-samples/serverless-patterns) |
| 248 | + * [Building fine-grained authorization using Amazon Cognito, API Gateway, and IAM](https://aws.amazon.com/ko/blogs/security/building-fine-grained-authorization-using-amazon-cognito-api-gateway-and-iam/) |
| 249 | + * [Amazon Athena Workshop](https://athena-in-action.workshop.aws/) |
| 250 | + * [Curl Cookbook](https://catonmat.net/cookbooks/curl) |
| 251 | + * [fastavro](https://fastavro.readthedocs.io/) - Fast read/write of `AVRO` files |
| 252 | + * [Apache Avro Specification](https://avro.apache.org/docs/current/spec.html) |
| 253 | + * [How to create a Lambda layer using a simulated Lambda environment with Docker](https://aws.amazon.com/premiumsupport/knowledge-center/lambda-layer-simulated-docker/) |
| 254 | + ``` |
| 255 | + $ cat <<EOF > requirements-Lambda-Layer.txt |
| 256 | + > fastavro==1.6.1 |
| 257 | + > EOF |
| 258 | + $ docker run -v "$PWD":/var/task "public.ecr.aws/sam/build-python3.9" /bin/sh -c "pip install -r requirements-Lambda-Layer.txt -t python/lib/python3.9/site-packages/; exit" |
| 259 | + $ zip -r fastavro-lib.zip python > /dev/null |
| 260 | + $ aws s3 mb s3://my-bucket-for-lambda-layer-packages |
| 261 | + $ aws s3 cp fastavro-lib.zip s3://my-bucket-for-lambda-layer-packages/ |
| 262 | + ``` |
9 | 263 |
|
10 | 264 | ## Security
|
11 | 265 |
|
|
0 commit comments