Skip to content

GHES instance with internal certificate throwing "unknown authority" error #381

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Hdom opened this issue May 2, 2025 · 38 comments
Open

Comments

@Hdom
Copy link

Hdom commented May 2, 2025

Hello, we are trying to set up a garm instance connected to an internal GitHub Enterprise Server instance and garm is unable to connect to the github instance due to an unknown authority tls error.

error:

time=2025-05-02T21:35:06.844Z level=INFO source=/workspace/runner/pool/pool.go:1599 msg="running initial tool update"
time=2025-05-02T21:35:06.894Z level=ERROR source=/workspace/runner/pool/pool.go:404 msg="failed to update tools for entity" error="Get \"https://github.internal/api/v3/repos/ORG/repo/actions/runners/downloads\": could not refresh installation id 2's token: could not get access_tokens from GitHub API for installation ID 2: tls: failed to verify certificate: x509: certificate signed by unknown authority\nfetching runner tools\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).FetchTools\n\t/workspace/runner/pool/pool.go:1999\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).updateTools\n\t/workspace/runner/pool/pool.go:402\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).Start.func1\n\t/workspace/runner/pool/pool.go:1600\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700" entity=UKG-FedRAMP/bst-garm-config
time=2025-05-02T21:35:06.894Z level=ERROR source=/workspace/runner/pool/pool.go:1601 msg="failed to update tools" error="failed to update tools for entity ORG/repo: fetching runner tools: Get \"https://github.internal/api/v3/repos/ORG/repo/actions/runners/downloads\": could not refresh installation id 2's token: could not get access_tokens from GitHub API for installation ID 2: tls: failed to verify certificate: x509: certificate signed by unknown authority"

I have tried everything I could think of:

  • Added Root CA cert, Intermediate CA cert, and host cert to /usr/local/share/ca-certificates and ran update-ca-certificates.
  • Verified that the certificates are present in /etc/ssl/certs once update-ca-certificates is ran.
  • Verified I can curl the instance from within the machine running garm without any issues, certificate is trusted.
  • Tried adding the certs to /usr/share/ca-certificates and updating /etc/ca-certificates.conf as advised by this https://stackoverflow.com/a/74575551.
  • Tried setting ENV SSL_CERT_FILE=/etc/ssl/certs/ca-certificates.crt and ENV SSL_CERT_DIR=/etc/ssl/certs together and separately. I also tried setting the environment variables directly in the machine (export ...) and directly before executing the garm service (SSL_CERT_DIR=/etc/ssl/certs ./garm -config config.toml).
  • Tried running with and without root permissions.

Runtime details:
Runtime Image: library/ubuntu:22.04
Additional Packages: gettext-base, wget, apt-transport-https, ca-certificates, gnupg, curl, jq, google-cloud-cli, kubectl, google-cloud-sdk-gke-gcloud-auth-plugin, yq

GARM Binary:
Custom garm binary based off v0.1.5.
Builder: Google Cloud Build.
Build Image: golang:1.24-bookworm
Build command:

go build -o $_FILE_NAME \
      -tags osusergo,netgo,sqlite_omit_load_extension \
      -ldflags "-linkmode external -extldflags '-static' -s -w -X main.Version=$(git describe --tags --match='v[0-9]*' --dirty --always)" \
      ./cmd/garm

Any ideas as to what might be causing this?
Could it be an incompatibility between the binary build in debian and running in ubuntu?

@gabriel-samfira
Copy link
Member

gabriel-samfira commented May 3, 2025

Hi @Hdom !

Try:

garm-cli github endpoint update --ca-cert-path /your/ghes/ca.pem ghes.example.com

See:

https://github.com/cloudbase/garm/blob/main/doc/using_garm.md#creating-a-github-endpoint

Edit:

The ca you specify here will be used to create the github client and will be passed along to the runners as an additional CA, so that your runners can also communicate with ghes.

@Hdom
Copy link
Author

Hdom commented May 3, 2025

Hello @gabriel-samfira, I had made sure to add the bundle before, still getting same error, here are the details:

  • Downloaded cert chain using </dev/null openssl s_client -connect github.internal:443 -showcerts | sed -n '/-----BEGIN/,/-----END/p' > bundle.pem
  • Updated endpoint garm-cli github endpoint update github.internal --ca-cert-path bundle.pem
  • There are 3 certs in the chain so I tried each individual one separately too.

@gabriel-samfira
Copy link
Member

Out of curiosity, could you try using a PAT instead of an app just to see if the error goes away?

This is where we compose the transport for the github client:

garm/params/params.go

Lines 635 to 687 in 0d53dce

func (g GithubCredentials) GetHTTPClient(ctx context.Context) (*http.Client, error) {
var roots *x509.CertPool
if g.CABundle != nil {
roots = x509.NewCertPool()
ok := roots.AppendCertsFromPEM(g.CABundle)
if !ok {
return nil, fmt.Errorf("failed to parse CA cert")
}
}
httpTransport := &http.Transport{
TLSClientConfig: &tls.Config{
RootCAs: roots,
MinVersion: tls.VersionTLS12,
},
}
var tc *http.Client
switch g.AuthType {
case GithubAuthTypeApp:
var app GithubApp
if err := json.Unmarshal(g.CredentialsPayload, &app); err != nil {
return nil, fmt.Errorf("failed to unmarshal github app credentials: %w", err)
}
if app.AppID == 0 || app.InstallationID == 0 || len(app.PrivateKeyBytes) == 0 {
return nil, fmt.Errorf("github app credentials are missing required fields")
}
itr, err := ghinstallation.New(httpTransport, app.AppID, app.InstallationID, app.PrivateKeyBytes)
if err != nil {
return nil, fmt.Errorf("failed to create github app installation transport: %w", err)
}
tc = &http.Client{Transport: itr}
default:
var pat GithubPAT
if err := json.Unmarshal(g.CredentialsPayload, &pat); err != nil {
return nil, fmt.Errorf("failed to unmarshal github app credentials: %w", err)
}
httpClient := &http.Client{Transport: httpTransport}
ctx = context.WithValue(ctx, oauth2.HTTPClient, httpClient)
if pat.OAuth2Token == "" {
return nil, fmt.Errorf("github credentials are missing the OAuth2 token")
}
ts := oauth2.StaticTokenSource(
&oauth2.Token{AccessToken: pat.OAuth2Token},
)
tc = oauth2.NewClient(ctx, ts)
}
return tc, nil
}

Apps and PATs have different code paths. For apps we leverage another package and I want to verify if it's because I don't properly pass in the CA certs to that package or if the issue is in the way we create the transport and pass it along to the github client.

Sadly, I do not have a GHES to test against, so I may need to bug you to verify potential fixes.

@gabriel-samfira
Copy link
Member

I think I may know where the issue is. Will propose a PR soon that you can test out.

@gabriel-samfira
Copy link
Member

But do try out PATs in the meantime.

@Hdom
Copy link
Author

Hdom commented May 3, 2025

@gabriel-samfira I apologize for the delay, I switched to a pat and it was able to update tools

time=2025-05-03T21:29:56.760Z level=INFO source=garm/runner/pool/pool.go:1599 msg="running initial tool update"
time=2025-05-03T21:29:57.018Z level=DEBUG source=garm/runner/pool/pool.go:413 msg="successfully updated tools" pool_mgr=ORG/repo pool_type=repository

@gabriel-samfira
Copy link
Member

okay. Then I need to focus on the ghinstallation bit and determine how to properly pass in the transport which includes the root ca pool.

@Hdom
Copy link
Author

Hdom commented May 3, 2025

I am available to verify potential fixes, just let me know

@gabriel-samfira
Copy link
Member

Count you try out:

@Hdom
Copy link
Author

Hdom commented May 3, 2025

It stopped throwing the unknown authority error which I see as progress but getting new error which I think its related to redirection

time=2025-05-03T22:04:16.048Z level=ERROR source=garm/runner/pool/pool.go:404 msg="failed to update tools for entity" error="Get \"https://github.internal/api/v3/repos/ORG/repo/actions/runners/downloads\": could not refresh installation id 2's token: received non 2xx response status \"302 Found\" when fetching https://github.internal/app/installations/2/access_tokens\nfetching runner tools\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).FetchTools\n\tgarm/runner/pool/pool.go:1999\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).updateTools\n\tgarm/runner/pool/pool.go:402\ngithub.com/cloudbase/garm/runner/pool.(*basePoolManager).Start.func1\n\tgarm/runner/pool/pool.go:1600\nruntime.goexit\n\t/home/dennyso/go/pkg/mod/golang.org/[email protected]/src/runtime/asm_amd64.s:1695" entity=ORG/repo
time=2025-05-03T22:04:16.048Z level=ERROR source=garm/runner/pool/pool.go:1601 msg="failed to update tools" error="failed to update tools for entity ORG/repo: fetching runner tools: Get \"https://github.internal/api/v3/repos/ORG/repo/actions/runners/downloads\": could not refresh installation id 2's token: received non 2xx response status \"302 Found\" when fetching https://github.internal/app/installations/2/access_tokens"

@Hdom
Copy link
Author

Hdom commented May 3, 2025

I believe APIBaseURL are values configured in the github endpoint, mine currently look like this:

+----------------+------------------------------------------------------------------+
| FIELD          | VALUE                                                            |
+----------------+------------------------------------------------------------------+
| Name           | github.internal                                                  |
| Base URL       | https://github.internal                                          |
| Upload URL     | https://github.internal                                          |
| API Base URL   | https://github.internal                                          |

Im going to try updating that value to include the api/v3

@gabriel-samfira
Copy link
Member

the the go-github client should do that automatically.

@Hdom
Copy link
Author

Hdom commented May 3, 2025

the the go-github client should do that automatically.

That may be so but it did fix the issue, tools are updating successfully now, not sure what other unintended consequences it might bring

@gabriel-samfira
Copy link
Member

there shouldn't be any. the URLs are only used in the github client. The BaseURL should be left as https://github.internal if that is the URL used to access the UI. We use that to determine the origin endpoint of the jobs. The API Base URL and Upload URL are used in the client and aparently should have the /api/v3 URI. Could you also please test that PATs still work after the change?

@gabriel-samfira
Copy link
Member

Ahh. I see why it didn't automatically add the URI:
https://github.com/google/go-github/blob/c9e1ad0d3b6526b0443a5eeaf38d2ff0c3e46d8e/github/github.go#L381-L399

It also looks for api. in the host. And yours doesn't have api.github.internal.

@gabriel-samfira
Copy link
Member

gabriel-samfira commented May 3, 2025

Yeah, I need to update the docs to explicitly say we need to add those paths.

I merged:

Should get you up and running.

@gabriel-samfira
Copy link
Member

apropos, if you're feeling brave or just bored, I am working on adding scale sets to GARM:

If you have a recent version of GHES and would like to test them out, I would appreciate it! The feature is mostly done. Some rough edges may exist, but for the most part it should work. Just make sure to use unique names for scale sets that are not the same as any labels your pools may have.

@Hdom
Copy link
Author

Hdom commented May 3, 2025

How are scalesets different than pools?

Also I am encountering a 500 error on /api/v1/metadata/runner-registration-token/ which is is not outputting any logs on the garm side or on the runner side.

I believe the obfuscation is coming from the fact that the API returns the error in the response content
https://github.com/cloudbase/garm/blob/main/apiserver/controllers/metadata.go#L33
but the calling logic never exposes the failure reason
https://github.com/cloudbase/garm-provider-common/blob/main/cloudconfig/templates.go#L141

I added a slog right before handleError to get a better picture of what is going on, turns out 403 Resource not accessible by integration now to figure out what permission I might be missing.

Edit: Turns out I was missing Repo Admin (write).

@gabriel-samfira
Copy link
Member

How are scalesets different than pools?

Scale sets are essentially pools but GHES server side. It allows the auto scaler to longpoll for jobs, specify max jobs and does not require webhooks to work. Once the auto scaler gets a jobs message, it can take responsibility for them and spin up workers.

It allows for simpler management of runners and in theory are more reliable in terms of letting the auto scaler know that jobs are available. Webhooks (from what GitHub say) are not reliable and may not be fired. You also never know if you have a job waiting if your autoscaler was down when the webhook was generated.

Also I am encountering a 500 error on /api/v1/metadata/runner-registration-token/ which is is not outputting any logs on the garm side or on the runner side.

I really need to switch everything from the "github.com/pkg/errors" package to the standard package. The mix between errors.Wrap and fmt.Errorf is annoying.

@Hdom
Copy link
Author

Hdom commented May 3, 2025

You also never know if you have a job waiting if your autoscaler was down when the webhook was generated.

Yeah it forces us to have idle available to handle that possibility, as the idle scaling system ensures missed webhooks are taken care of.

Have you had the opportunity to do any sort of performance testing on a large scale, I did some tests a while back and experienced as strange bottleneck both in the github webhook reception and in the garm processing that limits the "rpm" to around 200 runners per minute. My findings showed that splitting the performance test across multiple repos reduced the github side bottleneck but increases the garm bottleneck.

When using 1 repo and 1600 runners:

  • There was a 33 minute delay on receiving the last webhook workflowjob "queued" notification.
  • There was no noticeable delay on spinning up a runner after the notification was received.

When using 8 repos and 1600 runners:

  • Last webhook workflowjob "queued" notification was received within 5 minutes.
  • There was a 21 minutes delay in GARM spinning up the last runner.

@gabriel-samfira
Copy link
Member

gabriel-samfira commented May 3, 2025

I have not stressed it that hard, but I suspect that you'd hit API rate limits pretty quickly with 8x1600 runners. I am guessing you also raised max runners. Keep in mind that each provider op means executing the provider binary. You might want to check load average on your garm server.

I would love to get some metrics from you if you ever run another stress test. Perhaps some logs as well.

If you're really bored, try to stress test the scaleset branch. I am curious how that fares.

@gabriel-samfira
Copy link
Member

ahh. It was 200x8. My bad. Interesting. What is the value of "Minimum Job Age Backoff" in your controller settings?

garm-cli controller show

@Hdom
Copy link
Author

Hdom commented May 3, 2025

What is the value of "Minimum Job Age Backoff" in your controller settings?

30

I would love to get some metrics from you if you ever run another stress test. Perhaps some logs as well.

I am hoping to get another chance eventually but I will need to do it in an isolated environment this time, since it was very hard to see what was happening with all of the non-test specific noise.

On the subject of API limits, when it comes to requesting I used 16 load workflows, each spinning up 100 "test" workflows, since github generates a separate token for each workflow they don't interfere with each other's primary rate limits and would only interact on secondary rate limits.

It was 200x8.

Yes for the first test I did all 1600 runners in a single repo, the second test was 200 each in 8 repos.

@gabriel-samfira
Copy link
Member

gabriel-samfira commented May 3, 2025

set the backoff to 0 and see if that improves things:

garm-cli controller update --minimum-job-age-backoff=0

That backoff waits for 30 seconds from the time a job arives to the time when the pool reacts to the job. The reasoning was to allow any idle runners to pick the job up, reducing runner churn (not spin up a new runner only to tare it down because the idle runner picked up a job)

@Hdom
Copy link
Author

Hdom commented May 3, 2025

Ah interesting, Ill give it a try, the way I see it, since all of our pools have min-idle configured, the new runner that is spun up would take over the duty of the previous idle runner.

@gabriel-samfira
Copy link
Member

worst case scenario, any additional runners will be removed once the scale down routine runs.

@Hdom
Copy link
Author

Hdom commented May 3, 2025

Unfortunately I don't think ill get a chance to test the scalesets as I understand its a GHES only feature and our GHES server is fairly sensitive.

@gabriel-samfira
Copy link
Member

gabriel-samfira commented May 4, 2025

Scalesets work with github.com as well. It should be safe to test against GHES as well; that's how ARC works. No worries if you can't test it out! It would have been nice to get some early feedback, but it's not mandatory in any way.

I will probably mark it as "experimental" anyway in the first release.

@gabriel-samfira
Copy link
Member

You can see it in action here:

8e91a4e75c196b84.mp4

@Hdom
Copy link
Author

Hdom commented May 4, 2025

Ah sorry when you mentioned it worked with the GHES server side I assumed it was only for GitHub Enterprise Server and not GitHub Enterprise Cloud, if thats the case I can test it in our GHEC instance on its experimental release.

@Hdom
Copy link
Author

Hdom commented May 4, 2025

that looks very promising.

Side note, since we are talking about ARC, have you had anyone bring up the jobs-in-container feature?
https://docs.github.com/en/actions/writing-workflows/choosing-where-your-workflow-runs/running-jobs-in-a-container

I was able to Frankenstein a solution for our kubernetes container based runners that leverages this hook https://github.com/actions/runner-container-hooks to deploy a secondary pod (avoiding dind) which enables jobs-in-container feature support.

@gabriel-samfira
Copy link
Member

gabriel-samfira commented May 4, 2025

Ah sorry when you mentioned it worked with the GHES server side I assumed it was only for GitHub Enterprise Server and not GitHub Enterprise Cloud, if thats the case I can test it in our GHEC instance on its experimental release.

It was inacurate wording on my side. What I meant is that github have implemented the logic to handle runner scheduling in github itself. There are no more labels, just scaleset names (which are used as runs-on in workflows) and runner groups.

The autoscaler just subscribes to a message queue exposed via HTTP longpoll and reacts to those messges. A lot of the logic that exists in the GARM "pool manager" scalesets do by design before sending messages on the queue.

There is no need to install any 3rd party binaries in GHES itself.

Here is a presentation of ARC and scale sets (scale set APIs are not documented. Had to rely on ARC as docs):

https://github.com/actions/actions-runner-controller/blob/master/docs/gha-runner-scale-set-controller/README.md#how-it-works

Side note, since we are talking about ARC, have you had anyone bring up the jobs-in-container feature?

Nope. I wonder if having docker available in the runners, enables the ability to use this. I am not a fan of DinD. It's part of the reason I wrote GARM. Some things need VMs/bare metal/specialized hardware. If running some things in containers means compromising the security of the host because you need to run in privileged mode, might as well run a VM and dispose of it after use.

@Hdom
Copy link
Author

Hdom commented May 4, 2025

The autoscaler just subscribes to a message queue exposed via HTTP longpoll and reacts to those messges. A lot of the logic that exists in the GARM "pool manager" scalesets do by design before sending messages on the queue.

It sound like this also removes the necessity for publicly exposed webhook ingresses, which is a huge security consideration for us.

Nope. I wonder if having docker available in the runners, enables the ability to use this. I am not a fan of DinD. It's part of the reason I wrote GARM. Some things need VMs/bare metal/specialized hardware. If running some things in containers means compromising the security of the host because you need to run in privileged mode, might as well run a VM and dispose of it after use.

Yeah our security team also didn't like the idea of dind (I agreed) and it was a big factor of why I chose GARM over ARC. The runner hook I presented above replaces DIND with secondary pods which are still under our pod security admission controllers. The only use case left for our VMs is building container images.

@gabriel-samfira
Copy link
Member

gabriel-samfira commented May 4, 2025

It sound like this also removes the necessity for publicly exposed webhook ingresses, which is a huge security consideration for us.

That is correct. Only the metadata endpoints are needed, which can be exposed only to your IaaS.

Yeah our security team also didn't like the idea of dind (I agreed) and it was a big factor of why I chose GARM over ARC

Ha! it looks like all you need is docker on the runner: https://github.com/gsamfira/garm-testing/actions/runs/14816010451/job/41596525545

(The error is due to the fact that it's running inside a LXD container, not a VM).

All I had to do is:

garm-cli scaleset update 5 --extra-specs='{"extra_packages":["docker.io"]}'

@gabriel-samfira
Copy link
Member

@Hdom
Copy link
Author

Hdom commented May 4, 2025

Yeah if you run it in a VM it works fine, I am not sure about LXD as I've never used it before, but if you try to run it in a Kubernetes container it will fail unless you add the hook and configure some environment variables such as:

          - name: ACTIONS_RUNNER_CONTAINER_HOOKS
            value: /home/runner/k8s/index.js
          - name: ACTIONS_RUNNER_REQUIRE_JOB_CONTAINER
            value: "false"
          - name: ACTIONS_RUNNER_CONTAINER_HOOK_TEMPLATE
            value: /home/runner/k8s/job-template.yaml
          - name: ACTIONS_RUNNER_KUBERNETES_NAMESPACE
            value: namespace

You also need a work volume

      - name: work
        ephemeral:
          volumeClaimTemplate:
            metadata:
              creationTimestamp: null
            spec:
              accessModes:
              - ReadWriteOnce
              resources:
                requests:
                  storage: 20Gi
              storageClassName: standard-rwo
              volumeMode: Filesystem

and attached service account that has necessary permissions

@gabriel-samfira
Copy link
Member

Yeah if you run it in a VM it works fine, I am not sure about LXD as I've never used it before

Yup. LXD can create system containers and VMs.

but if you try to run it in a Kubernetes container it will fail unless you add the hook and configure some environment variables such as:

that's pretty nice! I should write some blog posts (or something) about using GARM for various use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants