Test for Data Engineer

Q1. Access Log analytics

Running environment

Linux
Python 3.6+

Instruction

download the access and unzip it to directory q1/data/. (URL: ftp://ita.ee.lbl.gov/traces/NASA_access_log_Aug95.gz)
change directory to q1/
run python count_http.py to count the total number of HTTP requests
run python count_top10_hosts.py to find the top-10 (host) hosts makes most requests from 18th Aug to 20th Aug
run python count_country.py to find out the country with most requests

Remark

For finding out the country with most requests, its assumed that the country of a server can be deduced from the domain name, which may not be true in reality.

Q2. RDBMS

Running environment

MySQL
Docker

Instruction for environment setup

install docker and init swarm
change directory to q2/ and issue command docker stack deploy -c docker-compose.yml mysql
connect local database with port 3308
run sql statements in table_schema.sql and test_data.sql
run the 2 statement in ans.sql the answer of question 2

Q3. Simple (but hard) counter

Thought process on the design:

For solving this problem, the server for accepting the voting requests needs to be horizontally scalable.
Besides the backend server, the data storage needs to be scalable too. Otherwise, the data storage will still become the bottleneck. NoSQL that do sharding / partitioning should be used.
Initially, I planned to use docker swarm to solve this problem. Docker swarm can help scaling out, but manual update of deployment config is need to do so. This requires monitoring and deployment effort. It turns out using docker swarm is probably not a good solution.
Given that the logic of the backend server is not complicated and need not to be stateful, serverless approach should be more suitable.

Design:

Frontend for voting: a static web-page, with button that issues ajax call for voting on click
- Information included in the ajax call: chosen candidate, kiosk ID (or simply the IP address of the kiosk if the IP address can identify the kiosk)
Backend server for processing voting requests: use AWS Lambda (with python) for implementation
- Upon a voting request is received, the python script will insert a record to DynamoDB. The record will include the candidate ID and the current timestamp.
Data storage: DynamoDB would be used
- partition key: concatenated string of the kiosk ID and the timestamp
- sort key: timestamp
- some thought about choosing the partition key:
  - candidate ID would not be suitable for scalability, reasons:
    - only 3 possible values
    - the values are not evenly populated
  - kiosk ID should not be suitable too, as some kiosks in densely populated area may get many votes
  - Concatenating kiosk ID with timestamp can further scatter the records into different partitions more evenly. This also avoid possible collision on primary key.
Frontend for viewing voting result: a static web-page with ajax call to get voting results
Backend server to give voting result: use AWS Lambda for implementation
- For retrieving voting result of last 10 minutes, a scan with timestamp as filter should be used.
- For retrieving the accumulated voting result, scan is still needed. However, full table scan for each of this request should be unnecessarily expensive. The result and the latest timestamp should be cached. And the next request can start from the last cached timestamp and increment from the last result.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
q1		q1
q2		q2
q3		q3
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Test for Data Engineer

Q1. Access Log analytics

Q2. RDBMS

Q3. Simple (but hard) counter

About

Releases

Packages

Languages

kenga/Test01

Folders and files

Latest commit

History

Repository files navigation

Test for Data Engineer

Q1. Access Log analytics

Q2. RDBMS

Q3. Simple (but hard) counter

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages