Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for Orleans example code: silo is not marked dead on graceful shutdown #2671

Closed
wants to merge 1 commit into from

Conversation

VincentH-Net
Copy link

The silo lifecycle was not tied to the host lifecycle, resulting in silo not marked dead when e.g. deploying a new revision to Azure Container Apps, which in turn results in startup errors in the new revision.

builder.Host.UseOrleans() is consistently used in other Orleans examples as a best practice

The silo lifecycle was not tied to the host lifecycle, resulting in silo not marked dead when e.g. deploying a new revision to Azure Container Apps, which in turn results in startup errors in the new revision. Host.UseOrleans() is consistently used in other Orleans examples as a best practice.
@davidfowl
Copy link
Member

Is this a bug in orleans @ReubenBond ?

@egil
Copy link

egil commented Feb 25, 2025

If builder.UseOrleans() option is wrong it shouldn't be there at all.

Don't require people to read the docs when the compiler can help.

@ReubenBond
Copy link
Member

ReubenBond commented Feb 25, 2025

It would be strange if this made a difference: if the silo wasn't shutting down alongside the host, then it also wouldn't start up alongside the host, since the mechanism is the same (IHostedService). To understand what's going on, we would need to look at the logs. ACA gives apps 30s to shutdown. If something was holding up the shutdown for longer than that (eg, grains not deactivating), it's possible that ACA will have terminated the host before it marked itself as dead in the membership table. Which version of Orleans are you using @VincentH-Net? I recommend upgrading to v9.1.2 since we made some fixes which will make recovery from these sorts of situations much faster (i.e, where membership says there are active silos but there are none because they were killed).

@VincentH-Net
Copy link
Author

@ReubenBond I'm using Orleans 8.2.0; I consistently saw that the previous revision silo was not marked dead after it was deactivated when deploying a new revision to ACA with Aspire.

Here are the logs containing shutdown across all revisions, as well as the revision history:

  • there are no shutdown logs before Feb 13
  • but there are many revisions deployed before that:
    image

And this is the commit that fixed it - on Feb 13:
image

I do plan to update to 9.1.2, however preventing is ofc better than recovering

@ReubenBond
Copy link
Member

@VincentH-Net the only difference which comes to mind is ordering: when you use the .Host version, the code will run after everything else. Maybe that is affecting shutdown - eg, because it affects the order that IHostedService implementations are shut down.

@VincentH-Net
Copy link
Author

@ReubenBond I finally had some time to create a clean repro per the Aspire docs and test it against Azure Container Apps and Azure Storage, and in both Orleans 8.2 and 9.1 the issue does not occur.

In my solution I had a HostedService that processed incoming MQTT messages in Orleans, which were continuously coming in at many messages per second:

builder.UseOrleans( ... );
builder.Services.AddHostedService<EnGatewayService>();  // uses Orleans client

In that situation, the issue did occur as I reported above, and I had to change the code to

builder.Host.UseOrleans( ... );
builder.Services.AddHostedService<EnGatewayService>();  // uses Orleans client

to fix it. Per your explanation this change causes Orleans to run after other code; however I'm not sure how this would cause a silo not to be marked dead on ACA container shutdown.
=> I'd love to understand - perhaps you can clarify?

For completeness sake, these were my clean repro steps, which confirm there is no issue then:

Build a clean solution:

  1. Create a new Aspire 9.1 empty project
  2. Add an Asp.Net Core minimal Api project as Orleans silo
  3. Follow the steps in .NET Aspire Orleans integration, only skip creating the Orleans client project

Verify issue:

  1. Deploy to Azure Container Apps and Azure Storage using Aspire as per Aspire doc
  2. In the Azure portal, open the created recource group and navigate to the listed Azure Storage resource
  3. In the Storage account, navigate to the Storage Browser and verify that the OrleansSiloInstances table exists and contains an Orleans silo instance with Status Active
  4. Navigate to the container app in the resource group and stop the app
  5. After the app has stopped, navigate to the Storage account and verify that all silos in the OrleansSiloInstances table have Status Dead

@ReubenBond
Copy link
Member

@VincentH-Net are you saying this issue reproduces on a clean solution with no IHostedService, or does it need a hosted service like the one in your example?

@VincentH-Net
Copy link
Author

VincentH-Net commented Mar 25, 2025

@ReubenBond The latter- issue only with a hosted service (that accesses Orleans nonstop during its entire lifecycle)

@ReubenBond
Copy link
Member

@VincentH-Net in that case, can you show me what your IHostedService code looks like?

@VincentH-Net
Copy link
Author

@VincentH-Net in that case, can you show me what your IHostedService code looks like?

@ReubenBond The issue occurred in customer code that I'm not free to share; I attempted to reproduce the issue today by extending above clean example to below code, but the issue did not occur there.

I did capture the lifecycle logging sequences in ACA to see if there was a difference in order with builder.UseOrleans(...) versus builder.Host.UseOrleans(...), but looks like it does not make a difference? (see below screenshots)

var builder = WebApplication.CreateBuilder(args);

builder.AddServiceDefaults();
builder.AddKeyedAzureTableClient("clustering");
builder.AddKeyedAzureBlobClient("grain-state");
builder.Host.UseOrleans(silo => { }); // compare with builder.UseOrleans()
builder.Services.AddHostedService<BusyOrleansAccessingService>();

var app = builder.Build();
app.Run();

sealed class BusyOrleansAccessingService(
    IClusterClient client,
    ILogger<BusyOrleansAccessingService> log
) : IHostedService, IDisposable
{
    Task? backgroundTask;
    CancellationTokenSource? stoppingCts;

    public Task StartAsync(CancellationToken cancellationToken)
    {
        log.LogInformation("Starting background task");
        stoppingCts = new CancellationTokenSource();
        backgroundTask = Task.Run(() => DoWork(stoppingCts.Token), cancellationToken);
        return Task.CompletedTask;
    }

    async Task DoWork(CancellationToken token)
    {
        for (int id = 0; !token.IsCancellationRequested; id = (id + 1) % 100)
        {
            var grain = client.GetGrain<ICounterGrain>(id.ToString());
            await grain.Increment();
            int counter = await grain.Get();
            log.LogInformation("Incremented counter {Counter} to {Value}", id, counter);
            await Task.Delay(TimeSpan.FromSeconds(0.1), token);
        }
    }

    public async Task StopAsync(CancellationToken cancellationToken)
    {
        log.LogInformation("Stopping background task");
        stoppingCts?.Cancel();
        if (backgroundTask is not null)
            await Task.WhenAny(backgroundTask, Task.Delay(Timeout.Infinite, cancellationToken));
    }

    public void Dispose() => stoppingCts?.Dispose();
}

[Alias("ICounterGrain")]
interface ICounterGrain : IGrainWithStringKey
{
    [Alias("Get")]       ValueTask<int> Get();
    [Alias("Increment")] ValueTask<int> Increment();
}

sealed class CounterGrain(
    [PersistentState("count")] IPersistentState<int> count) : ICounterGrain
{
    public ValueTask<int> Get() => ValueTask.FromResult(count.State);

    public async ValueTask<int> Increment()
    {
        var result = ++count.State;
        await count.WriteStateAsync();
        return result;
    }
}

Captured start / stop log comparison:
image

Conclusion

Even though the change I proposed in this PR did fix the issue in my customer's code, without a clean repro or a plausible explanation what the lifecycle difference (if any) between builder.UseOrleans and builder.Host.UseOrleans is, I see no reason to change the example in the docs (unless you can think of one?)

@VincentH-Net
Copy link
Author

Thanks for the support @ReubenBond! I gained learnings ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants