Skip to content

Comments

Refactor S3 retries to be in alignment with the AWS SDK.#5747

Open
teo-tsirpanis wants to merge 7 commits intomainfrom
teo/core-495-improve-our-slow_down-retry-strategy
Open

Refactor S3 retries to be in alignment with the AWS SDK.#5747
teo-tsirpanis wants to merge 7 commits intomainfrom
teo/core-495-improve-our-slow_down-retry-strategy

Conversation

@teo-tsirpanis
Copy link
Member

@teo-tsirpanis teo-tsirpanis commented Feb 9, 2026

This PR updates the retry logic of the S3 VFS to be in alignment with the AWS SDK. We added a new config option to set the retry strategy (which defaults to standard), and update the default value of vfs.s3.connect_max_tries to be empty, which lets the AWS SDK pick the value (from environment variables or profile configuration).

Because the AWS SDK's retry strategies don't support setting a scale factor, the vfs.s3.connect_scale_factor config option was deprecated and is no longer supported.


TYPE: FEATURE
DESC: Added vfs.s3.retry_strategy that allows customizing the AWS SDK retry strategy to use for requests to S3. Defaults to standard.

TYPE: BREAKING_BEHAVIOR
DESC: The vfs.s3.connect_scale_factor config option is no longer supported.

TYPE: BREAKING_BEHAVIOR
DESC: The vfs.s3.connect_max_tries config option defaults to an empty string, which lets the AWS SDK determine the maximum number of retries.

@teo-tsirpanis teo-tsirpanis requested a review from ypatia February 9, 2026 15:33
@ypatia ypatia requested a review from ihnorton February 12, 2026 10:53
@ypatia
Copy link
Member

ypatia commented Feb 12, 2026

LGTM but I'd like @ihnorton to confirm we're ok to do this right now.

@teo-tsirpanis teo-tsirpanis requested a review from a team as a code owner February 19, 2026 01:47
@teo-tsirpanis teo-tsirpanis force-pushed the teo/core-495-improve-our-slow_down-retry-strategy branch from 410ffd0 to b6cbe16 Compare February 19, 2026 02:00
@teo-tsirpanis teo-tsirpanis force-pushed the teo/core-495-improve-our-slow_down-retry-strategy branch from b6cbe16 to e57322f Compare February 19, 2026 02:02
@teo-tsirpanis teo-tsirpanis changed the title Always use exponential backoff to retry S3 requests. Refactor S3 retries to be in alignment with the AWS SDK. Feb 19, 2026
Copy link
Member

@ypatia ypatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add a description as this is a behavior change that can have impact.

const std::string Config::VFS_S3_CA_FILE = "";
const std::string Config::VFS_S3_CA_PATH = "";
const std::string Config::VFS_S3_CONNECT_TIMEOUT_MS = "10800";
const std::string Config::VFS_S3_CONNECT_MAX_TRIES = "5";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove the default for max tries rather than setting it to 10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it in order to give the AWS SDK the opportunity to determine the max tries through environment variables and profile settings.

Comment on lines +1423 to +1428
if (s3_params_.connect_max_tries_.has_value()) {
retry_strategy = Aws::Client::InitRetryStrategy(
s3_params_.connect_max_tries_.value(), s3_params_.retry_strategy_);
} else {
retry_strategy = Aws::Client::InitRetryStrategy(s3_params_.retry_strategy_);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (s3_params_.connect_max_tries_.has_value()) {
retry_strategy = Aws::Client::InitRetryStrategy(
s3_params_.connect_max_tries_.value(), s3_params_.retry_strategy_);
} else {
retry_strategy = Aws::Client::InitRetryStrategy(s3_params_.retry_strategy_);
}
retry_strategy = Aws::Client::InitRetryStrategy(
s3_params_.connect_max_tries_.value_or(10), s3_params_.retry_strategy_);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do that; the overload with the one parameter tries to retrieve the max retries from environment variables and profile configuration.

, connect_scale_factor_(config.get<int64_t>(
"vfs.s3.connect_scale_factor", Config::must_find))
, connect_max_tries_(config.get<int64_t>("vfs.s3.connect_max_tries"))
, has_connect_scale_factor_(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the retry_strategy_?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, added.


virtual long CalculateDelayBeforeNextRetry(
const Aws::Client::AWSError<Aws::Client::CoreErrors>& error,
long attemptedRetries) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override on the rest of the methods?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else if (param == "vfs.s3.connect_max_tries") {
RETURN_NOT_OK(utils::parse::convert(value, &vint64));
} else if (param == "vfs.s3.connect_scale_factor") {
} else if (param == "vfs.s3.connect_max_tries" && !value.empty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can add sanity check for the allowed values in retry_strategy as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer we don't, and pass the value to the AWS SDK as-is. It can handle invalid values.

// SSOCredentialsProvider
"TooManyRequestsException"},
s3_params_.connect_max_tries_);
s3_params_.connect_max_tries_.value_or(10));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just rely on the parameter default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but I thought of letting the config option influence retries in authentication requests as well.

@teo-tsirpanis
Copy link
Member Author

Updated description.

@teo-tsirpanis teo-tsirpanis requested a review from ypatia February 20, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants