In this assignment we simulate ingestion of the dataset 2018 Yellow Taxi Trip Data into a cassandra cluster, (mysimbdp-coredms
). The application manages to serve several users at a time (we have tested it with up to 500 concurrent users), with no data loss, thanks to the RabbitMQ cluster employed.
Here is the platform design:
code
├── client.py
├── db
│ ├── docker-compose.yml
│ └── setuppers
│ ├── Dockerfile
│ └── setup_db.py
├── install_dependencies.sh
├── queue
│ ├── consumer.py
│ ├── docker-compose.yml
│ └── Dockerfile
├── README.md
├── run.sh
├── start_db.sh
├── start_queue.sh
└── utils
├── Dockerfile
└── split_data.py
The deploy uses docker-compose for starting up 2 cassandra-db
and 2 bitnami/rabbitmq
containerised nodes. Other two components use containers as well:
- the program setting up the database keyspace (
setup_db
,code/db/setuppers/Dockerfile
); - the program in charge of dequeuing the messages forwarded to the RabbitMQ cluster (
consumer
,code/queue/Dockerfile
).
both are pre-built on top of a docker image of python3 with cassandra-python as a dependency (in code/utils/Dockerfile
), which I pushed to my docker repository so that it is automatically downloaded when composing.
The main building program is started by run
, which triggers the composition of the docker-compose files (code/db/docker-compose.yml
and code/queue/docker-compose.yml
) and then a number of clients (client.py
) specified by the user.
Logs can be found in logs
with each file following the format: <log_type>_<client_number>.log
, where log_type
is one of db
, queue
, client
and client_number
is the number of concurrent clients run for the ingestion.