Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(runtime): compare the results from factory functions correctly #420

Merged
merged 32 commits into from
Jan 15, 2025

Conversation

ardatan
Copy link
Member

@ardatan ardatan commented Jan 8, 2025

Fixes #419

Leave the supergraph configuration handling logic to fusion-runtime package so it can compare bare read supergraph sdl directly inside unified graph manager to decide if the supergraph has changed.
Before, it was parsed by gateway runtime and printed sdls might be different on manager side, so it was wrongly assuming that supergraph has changed whenever the supergraph fetcher has given by the user as a factory function.

Another set of improvements;

  • In case of schema reload, throw SCHEMA_RELOAD error while recreating the transports and executors
  • In case of shut down, throw SHUTTING_DOWN error while cleaning the transports and executors up

Previously, these errors are only thrown for subscriptions, it wasn't thrown in other type of operations as well.
And previously the thrown errors during these two cleanup and restart process were cryptic, now the mentioned two errors above are thrown with more clear messages

No more timers
Instead of setTimeout, on each request as we do on initial request, we compare the last fetch time and the current time to see if the supergraph should be refetched. We use a shared promise here to prevent race condition.
This will relax the usage in serverless, because some serverless envs complain about timers.
It will also relax the event loop which won't be busy if there is no request in the background.
In case of clustering, it will also relax the supergraph source, because when there is setTimeout, all members of the cluster will send requests at the same time to the source(CDN etc). Now they will make those requests lazily.

@ardatan ardatan force-pushed the polling-disposal branch 3 times, most recently from 582b325 to 64e7a04 Compare January 8, 2025 12:09
@ardatan ardatan changed the title fix: dispose the executor only if the schema has changed Reproduction #419 Jan 8, 2025
@ardatan ardatan force-pushed the polling-disposal branch 2 times, most recently from 26f4a0a to 8d8c090 Compare January 8, 2025 21:41
@ardatan ardatan force-pushed the polling-disposal branch 2 times, most recently from a1f433b to c90f029 Compare January 9, 2025 13:09
@ardatan ardatan changed the title Reproduction #419 fix(runtime): compare the results from factory functions correctly Jan 9, 2025
@ardatan ardatan marked this pull request as ready for review January 9, 2025 13:10
@theguild-bot
Copy link
Collaborator

theguild-bot commented Jan 10, 2025

🚀 Snapshot Release (alpha)

The latest changes of this PR are available as alpha on npm (based on the declared changesets):

Package Version Info
@graphql-tools/executor-http 1.2.5-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-tools/federation 3.0.9-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-mesh/fusion-runtime 0.10.29-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-hive/gateway 1.7.9-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-mesh/plugin-opentelemetry 1.3.36-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-mesh/plugin-prometheus 1.3.24-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-hive/gateway-runtime 1.4.8-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-mesh/transport-common 0.7.27-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-mesh/transport-http 0.6.31-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-mesh/transport-http-callback 0.5.18-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎
@graphql-mesh/transport-ws 0.4.16-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413 npm ↗︎ unpkg ↗︎

@theguild-bot
Copy link
Collaborator

theguild-bot commented Jan 10, 2025

🚀 Snapshot Release (Binary for Linux-X64)

The latest changes of this PR are available for download (based on the declared changesets).

Download

@theguild-bot
Copy link
Collaborator

theguild-bot commented Jan 10, 2025

🚀 Snapshot Release (Binary for macOS-ARM64)

The latest changes of this PR are available for download (based on the declared changesets).

Download

@theguild-bot
Copy link
Collaborator

theguild-bot commented Jan 10, 2025

🚀 Snapshot Release (Bun Docker Image)

The latest changes of this PR are available as image on GitHub Container Registry (based on the declared changesets):

ghcr.io/graphql-hive/gateway:1.7.9-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413-bun

@theguild-bot
Copy link
Collaborator

theguild-bot commented Jan 10, 2025

🚀 Snapshot Release (Node Docker Image)

The latest changes of this PR are available as image on GitHub Container Registry (based on the declared changesets):

ghcr.io/graphql-hive/gateway:1.7.9-alpha-2f4a1173f1280a85a6f87d9c1ab2db7e2a48d413

@theguild-bot
Copy link
Collaborator

theguild-bot commented Jan 10, 2025

🚀 Snapshot Release (Binary for macOS-X64)

The latest changes of this PR are available for download (based on the declared changesets).

Download

@theguild-bot
Copy link
Collaborator

theguild-bot commented Jan 10, 2025

🚀 Snapshot Release (Binary for Windows-X64)

The latest changes of this PR are available for download (based on the declared changesets).

Download

Copy link
Member

@enisdenjo enisdenjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from one small comment, LGTM.

@fauna5
Copy link

fauna5 commented Jan 10, 2025

Hey - i pulled the snapshot version 1.7.9-alpha-4221681019dfd2fd156d1f27570765f312fb6cd8 and the operation was not aborted. This fixes the issue we have for now. Thanks so much for getting on to this so quickly 🫶 This'll get us to a production deploy!

It still does abort when there is a schema change (this is much more rare, ~once/day). I guess it's a lot harder logic to allow the in-flight requests to complete on the old schema whilst validating new requests with the new schema. Keeping the old context around until all requests have completed.

However as a client, when that request is aborted we don't know if the downstream request passed or failed. We can add retry logic on the specific abort error. If the failed request is a non-idempotent mutation we'd need to investigate via extra reads and logic to see if the mutation is successful. Whilst it's a tiny edge-case that we will probably not see, it still is there.

One way around this is recycling our k8s pods on schema update, but that defies the point of the pollingInterval

@ardatan ardatan marked this pull request as draft January 11, 2025 15:29
@ardatan ardatan marked this pull request as ready for review January 11, 2025 23:14
@ardatan
Copy link
Member Author

ardatan commented Jan 11, 2025

@fauna5 If you try the new alphas, the errors will be more clear in case of schema reload. With 503 status code and SCHEMA_RELOAD code inside the error, you will know that the gateway is reloading the schema. During the schema reload, any ongoing HTTP calls will be aborted and the execution will immediately be stopped to prevent leaks.
I think blocking the schema reload for those ongoing calls can be too much. Like you said, it is not that often, and can be solved with a retry logic.

@ardatan ardatan requested a review from enisdenjo January 11, 2025 23:22
@ardatan ardatan merged commit 14152f7 into main Jan 15, 2025
31 checks passed
@ardatan ardatan deleted the polling-disposal branch January 15, 2025 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Downstream requests aborted when refreshing schema
4 participants