Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Automated Testing with Microservices Demo Workload #48

Open
4 tasks
yonch opened this issue Jan 31, 2025 · 26 comments
Open
4 tasks

Add Automated Testing with Microservices Demo Workload #48

yonch opened this issue Jan 31, 2025 · 26 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@yonch
Copy link
Contributor

yonch commented Jan 31, 2025

Motivation: To validate different collection strategies, we need automated testing with realistic workloads. The Google Microservices Demo provides a good initial test environment with multiple languages and garbage collectors (Go, C#, Node.js, Python, Java). Building on our existing GitHub Actions infrastructure, we want to automatically deploy Kubernetes and run this demo workload.

The result will be a GitHub Action that spins up an AWS instance, deploys Kubernetes and the workload, and tears down the instance.

Tasks:

  • Install Kubernetes (k3s or KinD) on the CI instance
  • Deploy Google Microservices Demo (github.com/GoogleCloudPlatform/microservices-demo)
  • Configure and run the included load-generator for 1 minute with non-trivial load
  • Collect and store load-generator statistics

This infrastructure will allow testing different collection strategies and can be extended to other workloads in the future.

@yonch yonch added good first issue Good for newcomers help wanted Extra attention is needed labels Jan 31, 2025
@Chanaka1200
Copy link

Hi @yonch

I will take on this task if no one else is working on it.

Thanks

@atimeofday
Copy link
Contributor

Hi @Chanaka1200

There was some discussion about me learning a few things to take this one on, but I ended up busy and then sick - your contribution would be greatly appreciated.

@Chanaka1200
Copy link

Hi @atimeofday

I completely understand how things can get busy, and I hope you're feeling better now. I'm happy to help and will do my best to contribute!

@Chanaka1200
Copy link

Hi, I’ve almost completed this, but I have an issue. I am using a single GitHub Action for both creating and destroying the VM in AWS. To avoid this, I am trying to use AWS SSM Parameter Store. Therefore, I am requesting permission for it form @yonch.

@yonch
Copy link
Contributor Author

yonch commented Mar 9, 2025

Great news! Can you say more about why we need to split to two actions? It's no problem adding permissions to SSM.

I'm asking to control complexity and potential leaked resources: if we keep everything in the same github action then it's easier to make cleanup more airtight. When we want to run multiple tests, each one is self contained...

@Chanaka1200
Copy link

Hi @yonch

I'm currently using a single action file to initiate a VM, set up K3, and deploy microservices, with the VM being destroyed at the end. I suggest splitting this into two separate actions: one for setup and deployment, and another for VM destruction. I found a way to pass the VM ID to the destroy action, but I welcome any other solutions!

@yonch
Copy link
Contributor Author

yonch commented Mar 10, 2025

I think you have the right workflow:

  • launch VM
  • start k3s
  • deploy Microservices
  • run a test or multiple tests
  • upload results as an artifact
  • terminate the instance

I don't have a use-case for separating the VM launch from termination; keeping the VMs around (e.g., so we can manually change things) runs the risk of drift, where the system would not be at the state we expect and so experiment results would be invalid.

Do you have a strong use-case that requires separating the creation and termination? If not, let's leave that capability to future work.

@Chanaka1200
Copy link

Chanaka1200 commented Mar 12, 2025

Cool, my mistake. I initially thought we needed to keep this VM up and running to collect the metrics. If I understand correctly now, we run this single action once, invoke a microservice to collect the metrics, and then push the artifact within the same action. Apologies for the confusion—I’ll update this to align with the provided workflow.

@yonch
Copy link
Contributor Author

yonch commented Mar 12, 2025

Great @Chanaka1200 sounds good! I'd love to start playing with this when you're done!

For a load generator, I am not sure if the Microservices demo bundles one; I've used Locust before, and a Grafana contributor told me over the weekend they use K6 (Grafana labs acquired the company that developed it, so not a huge surprise).

@Chanaka1200
Copy link

Thank you for your patience. I’m currently working on it and apologize for the slight delay. Regarding the load generator, there isn’t one available for that microservice at the moment, but I’ll generate it—no issue. Both Locust and K6 are fine, and we can explore what we can do. I have one question: Should I add steps to collect metrics from test-kernel-module.yaml to the current YAML I’m working on?

@yonch
Copy link
Contributor Author

yonch commented Mar 16, 2025

If it's not too much trouble, would help to get the parquet file that is produced by test-ebpf-module -- that is what we need.

but once there is a Kubernetes cluster with a workload and load generator, I can add that too, so if you think you won't be doing that for a while, PR what you have and I'll continue it on Monday..

@Chanaka1200
Copy link

Chanaka1200 commented Mar 16, 2025

No worries, I will complete this as soon as possible. Seems there is a load generator within the project that I am currently reviewing. Once I finish, I will implement the workflow and submit the PR.

loadgenerator

@yonch
Copy link
Contributor Author

yonch commented Mar 19, 2025

Hi @Chanaka1200 I really appreciate you taking on this ticket. I'd like to get a trial run going ahead of Kubecon EU to show some data in the talk. Not expecting any extra work -- it's volunteer work and on your schedule -- happy to pick up where you left off if you want to point me to it.

@Chanaka1200
Copy link

Chanaka1200 commented Mar 19, 2025

Hi @yonch I'm stuck on a small issue—I tried to pass input to run the action, but it's not passing correctly. Once I resolve this, the action file will be almost complete. Reason was Locust load generator pick pass user and rate count via environment variables

workflow_dispatch: # Manual trigger for testing
inputs:
users:
description: 'Number of concurrent users for load generator'
required: false
default: '200'
type: string
rate:
escription: 'Request per second'
required: false
default: '1'
type: string

microservice-deployment:
needs: [start-runner,init-ebpf, k3-deployment]
runs-on: ${{ needs.start-runner.outputs.label }}
steps:
- name: Use input
run: echo "Input was ${{ github.event.inputs.users }}"

@Chanaka1200
Copy link

Chanaka1200 commented Mar 19, 2025

It has been resolved for now using environment variables. I am still verifying, and I will send a PR once the eBPF test actions are implemented. I expect to complete this by the end of the day today.

@Chanaka1200
Copy link

Chanaka1200 commented Mar 19, 2025

Hi @yonch

I’m encountering an issue again after deploying with eBPF and microservices. The microservices are not running because there isn’t enough disk space on the two VM types we're using: m7i.metal-24xl for RDT and c5.9xlarge. I believe the machulav/[email protected] is using the default disk size, and there seems to be no option to configure the disk size expansion.

Currently, this is part of a pull request that has not been merged yet:
PR #220.

Would you recommend using a VM with a larger default disk size, or do you have any suggestions to resolve this issue?

Sample event logs:

LAST SEEN   TYPE      REASON                           OBJECT                                        MESSAGE
2m26s       Warning   FailedScheduling                 pod/adservice-997b6fc95-5xxgl                 0/1 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/disk-pressure: }. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling.
2m58s       Normal    Scheduled                        pod/adservice-997b6fc95-d5wdt                 Successfully assigned default/adservice-997b6fc95-d5wdt to ip-172-31-11-249
2m46s       Normal    Pulling                          pod/adservice-997b6fc95-d5wdt                 Pulling image "us-central1-docker.pkg.dev/google-samples/microservices-demo/adservice:v0.10.2"
2m57s       Warning   Failed                           pod/adservice-997b6fc95-d5wdt                 Failed to pull image "us-central1-docker.pkg.dev/google-samples/microservices-demo/adservice:v0.10.2": pull QPS exceeded
2m57s       Warning   Failed                           pod/adservice-997b6fc95-d5wdt                 Error: ErrImagePull
2m57s       Normal    BackOff                          pod/adservice-997b6fc95-d5wdt                 Back-off pulling image "us-central1-docker.pkg.dev/google-samples/microservices-demo/adservice:v0.10.2"
2m57s       Warning   Failed                           pod/adservice-997b6fc95-d5wdt                 Error: ImagePullBackOff
2m43s       Normal    Pulled                           pod/adservice-997b6fc95-d5wdt                 Successfully pulled image "us-central1-docker.pkg.dev/google-samples/microservices-demo/adservice:v0.10.2" in 2.975s (2.975s including waiting). Image size: 100256966 bytes.
2m42s       Normal    Created                          pod/adservice-997b6fc95-d5wdt                 Created container: server
2m41s       Normal    Started                          pod/adservice-997b6fc95-d5wdt                 Started container server
2m27s       Warning   Evicted                          pod/adservice-997b6fc95-d5wdt                 The node was low on resource: ephemeral-storage. Threshold quantity: 406608697, available: 385580Ki. Container server was using 28Ki, request is 0, has larger consumption of ephemeral-storage.

@yonch
Copy link
Contributor Author

yonch commented Mar 19, 2025

@tverghis has used a fork of that GitHub action with features we needed, and it worked fine until the maintainer merged those features. We could do that here as well with that branch. You can see the uses statement here.

@Chanaka1200
Copy link

Let me review this, and I will implement and test it as soon as possible.

@Chanaka1200
Copy link

Chanaka1200 commented Mar 20, 2025

Hi @yonch

I implemented @tverghis 's repository, but I noticed that it does not include disk space modifications. To address this, I used devin-purple's feature branch for the runner, and it is working fine.

Would it be okay to use this branch directly, or would you recommend forking it into my own repository before proceeding? Please let me know if any changes are needed.

If this approach is fine, I can send a PR, and we can review whether the eBPF implementation is correct.

Additionally, I have a question—while running the load generator, the eBPF collector runs in seconds to gather metrics. Is this behavior acceptable, or should any adjustments be made?

@yonch
Copy link
Contributor Author

yonch commented Mar 20, 2025

I think it should be fine to use their branch. Just please use a SHA to refer to the exact commit. I'm pretty sure Devin-purple will not maliciously add code to that branch, but it's the Internet! To be on the safe side let's refer to a known specific revision.

@Chanaka1200
Copy link

Sure! I will make the changes and send the PR today. Let me know if any further modifications are needed. I will inform you once the PR is sent.

@Chanaka1200
Copy link

Microservices deployment with Test eBPF Metrics collector

I have submitted the pull request. Please review it and let me know if any changes are needed. I’m happy to contribute to this project!

@yonch
Copy link
Contributor Author

yonch commented Mar 22, 2025

Awesome! Taking a look now

@Chanaka1200
Copy link

Please let me know if there's anything I should do or change. I'm happy to help!

@yonch
Copy link
Contributor Author

yonch commented Mar 23, 2025

  • The results look like there isn't much stress on the system, wondering if the rate is too low. I also saw the load generator is limited to a low number of millicores (maybe 300?) which might limit it

  • waiting for the system to become ready, we can watch for those events. Haven't tried it, but something like this might work:
    '''
    kubectl wait --for=condition=Available --timeout=300s deployment --all -n default
    '''
    https://kubernetes.io/docs/reference/kubectl/generated/kubectl_wait/

  • I took a look at the opentelemetry demo repo, is it a more recently maintained version of the microsecond demo?

@Chanaka1200
Copy link

I initially used the default resource values. Let me adjust them further to apply more load to the node. I used kubectl wait --for=condition=Available --timeout=300s deployment --all -n default, but sometimes a few pods take longer to become ready. I will try again. The OpenTelemetry demo repo and Google's microservices repo were both recently updated. I believe the issue was due to the default values. My apologies—I will correct this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants