-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Orleans example code: silo is not marked dead on graceful shutdown #2671
Conversation
The silo lifecycle was not tied to the host lifecycle, resulting in silo not marked dead when e.g. deploying a new revision to Azure Container Apps, which in turn results in startup errors in the new revision. Host.UseOrleans() is consistently used in other Orleans examples as a best practice.
Is this a bug in orleans @ReubenBond ? |
If builder.UseOrleans() option is wrong it shouldn't be there at all. Don't require people to read the docs when the compiler can help. |
It would be strange if this made a difference: if the silo wasn't shutting down alongside the host, then it also wouldn't start up alongside the host, since the mechanism is the same ( |
@ReubenBond I'm using Orleans 8.2.0; I consistently saw that the previous revision silo was not marked dead after it was deactivated when deploying a new revision to ACA with Aspire. Here are the logs containing shutdown across all revisions, as well as the revision history: And this is the commit that fixed it - on Feb 13: I do plan to update to 9.1.2, however preventing is ofc better than recovering |
@VincentH-Net the only difference which comes to mind is ordering: when you use the |
@ReubenBond I finally had some time to create a clean repro per the Aspire docs and test it against Azure Container Apps and Azure Storage, and in both Orleans 8.2 and 9.1 the issue does not occur. In my solution I had a HostedService that processed incoming MQTT messages in Orleans, which were continuously coming in at many messages per second: builder.UseOrleans( ... );
builder.Services.AddHostedService<EnGatewayService>(); // uses Orleans client In that situation, the issue did occur as I reported above, and I had to change the code to builder.Host.UseOrleans( ... );
builder.Services.AddHostedService<EnGatewayService>(); // uses Orleans client to fix it. Per your explanation this change causes Orleans to run after other code; however I'm not sure how this would cause a silo not to be marked dead on ACA container shutdown. For completeness sake, these were my clean repro steps, which confirm there is no issue then: Build a clean solution:
Verify issue:
|
@VincentH-Net are you saying this issue reproduces on a clean solution with no IHostedService, or does it need a hosted service like the one in your example? |
@ReubenBond The latter- issue only with a hosted service (that accesses Orleans nonstop during its entire lifecycle) |
@VincentH-Net in that case, can you show me what your |
@ReubenBond The issue occurred in customer code that I'm not free to share; I attempted to reproduce the issue today by extending above clean example to below code, but the issue did not occur there. I did capture the lifecycle logging sequences in ACA to see if there was a difference in order with var builder = WebApplication.CreateBuilder(args);
builder.AddServiceDefaults();
builder.AddKeyedAzureTableClient("clustering");
builder.AddKeyedAzureBlobClient("grain-state");
builder.Host.UseOrleans(silo => { }); // compare with builder.UseOrleans()
builder.Services.AddHostedService<BusyOrleansAccessingService>();
var app = builder.Build();
app.Run();
sealed class BusyOrleansAccessingService(
IClusterClient client,
ILogger<BusyOrleansAccessingService> log
) : IHostedService, IDisposable
{
Task? backgroundTask;
CancellationTokenSource? stoppingCts;
public Task StartAsync(CancellationToken cancellationToken)
{
log.LogInformation("Starting background task");
stoppingCts = new CancellationTokenSource();
backgroundTask = Task.Run(() => DoWork(stoppingCts.Token), cancellationToken);
return Task.CompletedTask;
}
async Task DoWork(CancellationToken token)
{
for (int id = 0; !token.IsCancellationRequested; id = (id + 1) % 100)
{
var grain = client.GetGrain<ICounterGrain>(id.ToString());
await grain.Increment();
int counter = await grain.Get();
log.LogInformation("Incremented counter {Counter} to {Value}", id, counter);
await Task.Delay(TimeSpan.FromSeconds(0.1), token);
}
}
public async Task StopAsync(CancellationToken cancellationToken)
{
log.LogInformation("Stopping background task");
stoppingCts?.Cancel();
if (backgroundTask is not null)
await Task.WhenAny(backgroundTask, Task.Delay(Timeout.Infinite, cancellationToken));
}
public void Dispose() => stoppingCts?.Dispose();
}
[Alias("ICounterGrain")]
interface ICounterGrain : IGrainWithStringKey
{
[Alias("Get")] ValueTask<int> Get();
[Alias("Increment")] ValueTask<int> Increment();
}
sealed class CounterGrain(
[PersistentState("count")] IPersistentState<int> count) : ICounterGrain
{
public ValueTask<int> Get() => ValueTask.FromResult(count.State);
public async ValueTask<int> Increment()
{
var result = ++count.State;
await count.WriteStateAsync();
return result;
}
} Captured start / stop log comparison: ConclusionEven though the change I proposed in this PR did fix the issue in my customer's code, without a clean repro or a plausible explanation what the lifecycle difference (if any) between |
Thanks for the support @ReubenBond! I gained learnings ;-) |
The silo lifecycle was not tied to the host lifecycle, resulting in silo not marked dead when e.g. deploying a new revision to Azure Container Apps, which in turn results in startup errors in the new revision.
builder.Host.UseOrleans()
is consistently used in other Orleans examples as a best practice