-
Notifications
You must be signed in to change notification settings - Fork 29
GHES instance with internal certificate throwing "unknown authority" error #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @Hdom ! Try: garm-cli github endpoint update --ca-cert-path /your/ghes/ca.pem ghes.example.com See: https://github.com/cloudbase/garm/blob/main/doc/using_garm.md#creating-a-github-endpoint Edit: The ca you specify here will be used to create the github client and will be passed along to the runners as an additional CA, so that your runners can also communicate with ghes. |
Hello @gabriel-samfira, I had made sure to add the bundle before, still getting same error, here are the details:
|
Out of curiosity, could you try using a PAT instead of an app just to see if the error goes away? This is where we compose the transport for the github client: Lines 635 to 687 in 0d53dce
Apps and PATs have different code paths. For apps we leverage another package and I want to verify if it's because I don't properly pass in the CA certs to that package or if the issue is in the way we create the transport and pass it along to the github client. Sadly, I do not have a GHES to test against, so I may need to bug you to verify potential fixes. |
I think I may know where the issue is. Will propose a PR soon that you can test out. |
But do try out PATs in the meantime. |
@gabriel-samfira I apologize for the delay, I switched to a pat and it was able to update tools
|
okay. Then I need to focus on the ghinstallation bit and determine how to properly pass in the transport which includes the root ca pool. |
I am available to verify potential fixes, just let me know |
Count you try out: |
It stopped throwing the unknown authority error which I see as progress but getting new error which I think its related to redirection
|
The access_tokens url seems to be missing the api part: |
I believe APIBaseURL are values configured in the github endpoint, mine currently look like this:
Im going to try updating that value to include the api/v3 |
the the go-github client should do that automatically. |
That may be so but it did fix the issue, tools are updating successfully now, not sure what other unintended consequences it might bring |
there shouldn't be any. the URLs are only used in the github client. The BaseURL should be left as |
Ahh. I see why it didn't automatically add the URI: It also looks for |
Yeah, I need to update the docs to explicitly say we need to add those paths. I merged: Should get you up and running. |
apropos, if you're feeling brave or just bored, I am working on adding scale sets to GARM: If you have a recent version of GHES and would like to test them out, I would appreciate it! The feature is mostly done. Some rough edges may exist, but for the most part it should work. Just make sure to use unique names for scale sets that are not the same as any labels your pools may have. |
How are scalesets different than pools? Also I am encountering a 500 error on /api/v1/metadata/runner-registration-token/ which is is not outputting any logs on the garm side or on the runner side. I believe the obfuscation is coming from the fact that the API returns the error in the response content I added a slog right before handleError to get a better picture of what is going on, turns out Edit: Turns out I was missing Repo Admin (write). |
Scale sets are essentially pools but GHES server side. It allows the auto scaler to longpoll for jobs, specify max jobs and does not require webhooks to work. Once the auto scaler gets a jobs message, it can take responsibility for them and spin up workers. It allows for simpler management of runners and in theory are more reliable in terms of letting the auto scaler know that jobs are available. Webhooks (from what GitHub say) are not reliable and may not be fired. You also never know if you have a job waiting if your autoscaler was down when the webhook was generated.
I really need to switch everything from the |
Yeah it forces us to have idle available to handle that possibility, as the idle scaling system ensures missed webhooks are taken care of. Have you had the opportunity to do any sort of performance testing on a large scale, I did some tests a while back and experienced as strange bottleneck both in the github webhook reception and in the garm processing that limits the "rpm" to around 200 runners per minute. My findings showed that splitting the performance test across multiple repos reduced the github side bottleneck but increases the garm bottleneck. When using 1 repo and 1600 runners:
When using 8 repos and 1600 runners:
|
I have not stressed it that hard, but I suspect that you'd hit API rate limits pretty quickly with 8x1600 runners. I am guessing you also raised max runners. Keep in mind that each provider op means executing the provider binary. You might want to check load average on your garm server. I would love to get some metrics from you if you ever run another stress test. Perhaps some logs as well. If you're really bored, try to stress test the scaleset branch. I am curious how that fares. |
ahh. It was 200x8. My bad. Interesting. What is the value of "Minimum Job Age Backoff" in your controller settings? garm-cli controller show |
30
I am hoping to get another chance eventually but I will need to do it in an isolated environment this time, since it was very hard to see what was happening with all of the non-test specific noise. On the subject of API limits, when it comes to requesting I used 16 load workflows, each spinning up 100 "test" workflows, since github generates a separate token for each workflow they don't interfere with each other's primary rate limits and would only interact on secondary rate limits.
Yes for the first test I did all 1600 runners in a single repo, the second test was 200 each in 8 repos. |
set the backoff to 0 and see if that improves things: garm-cli controller update --minimum-job-age-backoff=0 That backoff waits for 30 seconds from the time a job arives to the time when the pool reacts to the job. The reasoning was to allow any idle runners to pick the job up, reducing runner churn (not spin up a new runner only to tare it down because the idle runner picked up a job) |
Ah interesting, Ill give it a try, the way I see it, since all of our pools have min-idle configured, the new runner that is spun up would take over the duty of the previous idle runner. |
worst case scenario, any additional runners will be removed once the scale down routine runs. |
Unfortunately I don't think ill get a chance to test the scalesets as I understand its a GHES only feature and our GHES server is fairly sensitive. |
Scalesets work with github.com as well. It should be safe to test against GHES as well; that's how ARC works. No worries if you can't test it out! It would have been nice to get some early feedback, but it's not mandatory in any way. I will probably mark it as "experimental" anyway in the first release. |
You can see it in action here: 8e91a4e75c196b84.mp4 |
Ah sorry when you mentioned it worked with the GHES server side I assumed it was only for GitHub Enterprise Server and not GitHub Enterprise Cloud, if thats the case I can test it in our GHEC instance on its experimental release. |
that looks very promising. Side note, since we are talking about ARC, have you had anyone bring up the jobs-in-container feature? I was able to Frankenstein a solution for our kubernetes container based runners that leverages this hook https://github.com/actions/runner-container-hooks to deploy a secondary pod (avoiding dind) which enables jobs-in-container feature support. |
It was inacurate wording on my side. What I meant is that github have implemented the logic to handle runner scheduling in github itself. There are no more labels, just scaleset names (which are used as The autoscaler just subscribes to a message queue exposed via HTTP longpoll and reacts to those messges. A lot of the logic that exists in the GARM "pool manager" scalesets do by design before sending messages on the queue. There is no need to install any 3rd party binaries in GHES itself. Here is a presentation of ARC and scale sets (scale set APIs are not documented. Had to rely on ARC as docs):
Nope. I wonder if having docker available in the runners, enables the ability to use this. I am not a fan of DinD. It's part of the reason I wrote GARM. Some things need VMs/bare metal/specialized hardware. If running some things in containers means compromising the security of the host because you need to run in privileged mode, might as well run a VM and dispose of it after use. |
It sound like this also removes the necessity for publicly exposed webhook ingresses, which is a huge security consideration for us.
Yeah our security team also didn't like the idea of dind (I agreed) and it was a big factor of why I chose GARM over ARC. The runner hook I presented above replaces DIND with secondary pods which are still under our pod security admission controllers. The only use case left for our VMs is building container images. |
That is correct. Only the metadata endpoints are needed, which can be exposed only to your IaaS.
Ha! it looks like all you need is docker on the runner: https://github.com/gsamfira/garm-testing/actions/runs/14816010451/job/41596525545 (The error is due to the fact that it's running inside a LXD container, not a VM). All I had to do is: garm-cli scaleset update 5 --extra-specs='{"extra_packages":["docker.io"]}' |
Here is one run on a VM: https://github.com/gsamfira/garm-testing/actions/runs/14816067727/job/41596664596 |
Yeah if you run it in a VM it works fine, I am not sure about LXD as I've never used it before, but if you try to run it in a Kubernetes container it will fail unless you add the hook and configure some environment variables such as:
You also need a work volume
and attached service account that has necessary permissions |
Yup. LXD can create system containers and VMs.
that's pretty nice! I should write some blog posts (or something) about using GARM for various use cases. |
Hello, we are trying to set up a garm instance connected to an internal GitHub Enterprise Server instance and garm is unable to connect to the github instance due to an
unknown authority
tls error.error:
I have tried everything I could think of:
/usr/local/share/ca-certificates
and ranupdate-ca-certificates
./etc/ssl/certs
onceupdate-ca-certificates
is ran./usr/share/ca-certificates
and updating/etc/ca-certificates.conf
as advised by this https://stackoverflow.com/a/74575551.ENV SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt
andENV SSL_CERT_DIR=/etc/ssl/certs
together and separately. I also tried setting the environment variables directly in the machine (export ...
) and directly before executing the garm service (SSL_CERT_DIR=/etc/ssl/certs ./garm -config config.toml
).Runtime details:
Runtime Image: library/ubuntu:22.04
Additional Packages: gettext-base, wget, apt-transport-https, ca-certificates, gnupg, curl, jq, google-cloud-cli, kubectl, google-cloud-sdk-gke-gcloud-auth-plugin, yq
GARM Binary:
Custom garm binary based off v0.1.5.
Builder: Google Cloud Build.
Build Image: golang:1.24-bookworm
Build command:
Any ideas as to what might be causing this?
Could it be an incompatibility between the binary build in debian and running in ubuntu?
The text was updated successfully, but these errors were encountered: