Trifork Test Work Summary and Conclusions

2015-02-05 (Feb. 5 2015) Trifork Test Work Summary and Conclussions

Trifork has improved rtcloud in the following areas:

Proisioning of pre-made EBS volumes with Riak data.
Automated build and provisioning of riak_ee from github.
Automated build and provisioning of riak_test from github.
Ansible-driven riak_test runs on provisioned clusters.
Test report generation to capture config and output of riak_test runs.
Test report upload to S3.
AWS Command line utility.
Jenkins job setup.
Python code restructuring

Each point is elaborated below.

Provisioning of pre-made EBS volumes with Riak data.

Riak often behaves very different when tested with a sigificant amount of data. Because of the time and IO ressources it takes to generate a usable dataset, it is not practical to have the test do this itself by populating Riak using basho_bench for instance. Instead, we want to have a pregenerated data set, that the test can start out with.

We built such a dataset using a copy-on-write (CoW) FS, because CoW allows for easily rolling back to a previous snapshot even after a test has modified data. BTRFS is readily available on linux, so this was the weapon of choice. The creation of 30mio key dataset went well on a 5-node Riak-cluster. To use them in a test, add the following line in cluster.config under each relevant cluster:

ebs_snapshots = snap-0ec90bf0,snap-08c90bf6,snap-07c90bf9,snap-02c90bfc,snap-01c90bff

The snapshots contain the entire Riak data-directory including a leveldb backend, anti-entropy and ring.

Because ring is included, the ebs volumes must be mounted in the right order. I.e. node 1 must have the first snapshot, aso. When running dual cluster tests, rtcloud automatically adjusts the ringfiles of cluster 2 to the correct node names (riak201.priv instead of riak101.priv aso.)

Data shape

Data is modelled around the Danish Medical use case having an average value size of 400 bytes uncrompressable with 2 secondary indeces pr. key. Due to overhead each object replica takes up about 600 bytes. anti_entropy adds about 25% to the space cost for this data shape.

Building a 1B key dataset on EBS/Btrfs/Linux

Significant efforts were put into generating a large dataset, which would be half the estimated Danish Medical use case target of 2B keys. However, EBS problems continued to emerge, with the device becomming unavailable to the kernel, and the kernel waiting for more than 2 mins for IO, ultimately crashing the Riak instance. Numerous attempts were made to fix the nodes, where the EBS volume had crashed including:

Destroying and recreating the EBS volume, restarting the cluster, and waiting for AAE to populate the empty node.
Destroying and recreating EC2 instance.
Upgrading linux kernel
Installing EC2 Enhanced networking.
Changing region between eu-west-1 and us-east-1

These efforts were fruitless, as one or two of the cluster nodes continued to crash after 1-10 hours of data initialization load. It was not tried to change the FS to ext4 for instance, as that would have meant not using CoW, and would require redisigning the test. Another option would be FS on linux

stop-test-and-revert-data.yml

The ansible playbook stop-test-and-revert-data.yml will stop any beam proces running on the testx01 machines, revert the clusters to the startup state, so the cluster(s) is/are ready for another test from scratch. This is faster than provisioning new EC2 instances and also cheaper in EC2 costs.

Automated build and provisioning of riak_ee from github.

When working on riak_ee, and testing during the work, the previous 2 ways of getting the riak-ee release on rtcloud were both somewhat involved.

Create and wrap up the ee-release, upload it to S3, modify rtcloud to download and use it.
With an "active" rtcloud, go into the riak_ee root dir, and run the rtcloud-riak-build command, which would build riak and provision the clusters with it.

To make it easier to automate tests using rtcloud, support was added for specifying riak_ee_branch in cluster.config. This makes rtcloud build and deploy riak_ee from the givn branch as part of the cluster provisioning.

Example cluster.config line: riak_ee_branch = 2.0

Automated build and provisioning of riak_test from github.

Much similar to testing custom riak_ee releases, the tests themselves are also often WIP when doing rtcloud tests. To accomodate this, rtcloud now builds and deploys riak_test from the github tag specified in cluster.config.

Example cluster.config line: riak_test_branch = enhance/krab/cluster-realtime-rebalance-cleanup

Ansible-driven riak_test runs on provisioned clusters.

To actually run a test against a riak_test cluster, you choose which test you want to run, and let ansible do it like this: ansible-playbook -i /path/to/rtcloud/clusters/my_cluster_name/generated/inventory ../../playbooks/run-test.yml -e "test=replication2_large_scale"

Test report generation to capture config and output of riak_test runs.

After having a test completed, generating the test report is as simple as: ansible-playbook -i /path/to/rtcloud/clusters/my_cluster_name/generated/inventory ../../playbooks/generate-test-report.yml

This extracts information from bench and the test machines and stuffs it together with the cluster.config in an archive named something like: test-report-20150120174833.tar.gz`

Test report upload to S3.

All test reports are automaticaly uploaded to Basho's S3 account in the S3 bucket rtcloud-test-reports

AWS Command line utility.

Out of necesity, and lack of AWS web access, we built the ebs.py tool, which supports a range of aws-related operations.

rsl@linux-54fw:~/projects/rtcloud> python bin/ebs.py

usage:
ebs.py create_volume sizeInGB [snapshot_id]
ebs.py create_volume_eu sizeInGB [snapshot_id]
ebs.py delete_volume volume_id
ebs.py delete_volume_eu volume_id
ebs.py attach_volume volume_id instance_id
ebs.py attach_volume_eu volume_id instance_id
ebs.py detach_volume volume_id
ebs.py detach_volume_eu volume_id
ebs.py create_snapshot volume_id
ebs.py create_snapshot_eu volume_id
ebs.py delete_snapshot snapshot_id
ebs.py delete_snapshot_eu snapshot_id
ebs.py copy_snapshot_to_eu snapshot_id
ebs.py copy_snapshot_to_us snapshot_id
ebs.py show_volumes region
ebs.py show_snapshots region
ebs.py reboot instance_id region
ebs.py terminate instance_id region
ebs.py placement_groups region
ebs.py delete_placement_group region name
ebs.py security_groups region
ebs.py instances region

Jenkins job setup

To facilitate CI internally in Trifork, we quickly setup a jenkins instance that:

Provision 2 clusters
Ran a repl test with bckground load
Generate and upload test report
Shutdown the provisioned machines

A test run took about an hour. Jenkins was setup to checkout rtcloud from github and then run this shell script:

cp -r preconfigured_clusters/largescaletest clusters
cd clusters/largescaletest
rtcloud-cluster-create
ansible-playbook -i /path/to/rtcloud/clusters/my_cluster_name/generated/inventory ../../playbooks/run-test.yml  -e "test=replication2_large_scale" || true
ansible-playbook -i /path/to/rtcloud/clusters/my_cluster_name/generated/inventory ../../playbooks/generate-test-report.yml || true
sleep 1000 # Some time to debug before everything is wiped.
rtcloud-cluster-destroy

Python code restructuring

Commonly used functions were moved to bin/rtcloud/*.py to reduce/avoid code duplication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly