Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests define custom cluster replica sizes #30459

Merged
merged 8 commits into from
Nov 18, 2024

Conversation

antiguru
Copy link
Member

@antiguru antiguru commented Nov 13, 2024

Tests define their cluster replica sizes without relying on the builtin defaults.

Motivation

This allows us to decouple the builtin sizes in the materialize binary to differ from what tests require. It's a step towards a better emulator experience.

This PR turned out bigger than expected because we'd use the builtin defaults in places where we shouldn't. Tracing the places is hard, often because the place where the defaults are used is separated from where we encounter potential inconsistencies. To avoid the issues going forward, this PR changes the way we handle default cluster replica sizes: It's not an optional parameter anymore but all callers need to provide it to environmentd and testdrive. Only sqllogictest still uses the default.

The PR adjusts all entrypoints within the materialize repo to supply a cluster replica size map, and all uses outside the repo should do so already.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

@antiguru antiguru requested review from a team as code owners November 13, 2024 13:49
Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nightly run triggered, I'd suggest waiting for it: https://buildkite.com/materialize/nightly/builds/10370

@antiguru antiguru force-pushed the test_cluster_replica_sizes branch 2 times, most recently from d7128b3 to 084b028 Compare November 14, 2024 09:17
@antiguru antiguru requested a review from a team as a code owner November 14, 2024 09:17
@antiguru
Copy link
Member Author

I've capitulated trying to work with both the default value and the value we'd like to use in tests. There were too many places that would rely on the default value and hence cause hard-to-debug test failures. Some code constructs a default replica size map and then it'll get passed around a bunch, introducing a gap between introducing the problem and an error actually happening.

Instead, I think we should never rely on the default encoded in environmentd, and always pass a replica size map. This makes it significantly cleaner as to what values are expected, but it causes a bunch of code changes. I updated the PR to that effect, let's see what tests say.

@def-
Copy link
Contributor

def- commented Nov 14, 2024

If you want I can also take over cleaning up the test failures.

@antiguru antiguru force-pushed the test_cluster_replica_sizes branch 6 times, most recently from d4ae851 to 6183279 Compare November 14, 2024 14:03
@antiguru antiguru requested a review from def- November 14, 2024 14:13
@antiguru
Copy link
Member Author

Another nightly run: https://buildkite.com/materialize/nightly/builds/10390

@@ -775,7 +778,7 @@ def run(
return self.invoke(
"run",
*(["--entrypoint", entrypoint] if entrypoint else []),
*(f"-e{k}={v}" for k, v in env_extra.items()),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider printing env_extra to make it easier to debug. Including the values in the command spams the terminal, so I changed it to just pass the keys here, and pass the environment separately.

Comment on lines -123 to +126
impl Default for ClusterReplicaSizeMap {
// Used for testing and local purposes. This default value should not be used in production.
//
// Credits per hour are calculated as being equal to scale. This is not necessarily how the
// value is computed in production.
fn default() -> Self {
/// Used for testing and local purposes. This default value should not be used in production.
///
/// Credits per hour are calculated as being equal to scale. This is not necessarily how the
/// value is computed in production.
pub fn for_tests() -> Self {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is load-bearing because we remove the default implementation. Code that depends on a value need to use for_tests, which should make it sufficiently clear that this value isn't to be used in many situations.

Comment on lines +258 to +262
def bootstrap_cluster_replica_size() -> str:
return "bootstrap"
Copy link
Member Author

@antiguru antiguru Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using the size bootstrap for all replicas created with the bootstrap size. It's defined to be equivalent to the size 1. This allows us to detect situations when the builtin replica size map is used instead, because it doesn't define the bootstrap replica size.

@antiguru
Copy link
Member Author

This probably breaks the emulator! I'll have a change for that in a jiffy.

Copy link
Contributor

@def- def- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test part lgtm, nightly will need a bunch of cloudtest fixes too. See misc/python/materialize/cloudtest/k8s/environmentd.py

@antiguru
Copy link
Member Author

I'm out for the rest of they day, so if you @def- want to take over, I'd be happy!

@antiguru antiguru force-pushed the test_cluster_replica_sizes branch 4 times, most recently from b9969b1 to ef1cfd9 Compare November 18, 2024 09:44
Copy link
Contributor

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

antiguru and others added 8 commits November 18, 2024 18:02
Signed-off-by: Moritz Hoffmann <[email protected]>

fixup

Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
Signed-off-by: Moritz Hoffmann <[email protected]>
@def- def- force-pushed the test_cluster_replica_sizes branch from ef1cfd9 to de02135 Compare November 18, 2024 18:02
@def-
Copy link
Contributor

def- commented Nov 18, 2024

I have fixed up the remaining cloudtest failure.

@def- def- merged commit fc069a4 into MaterializeInc:main Nov 18, 2024
213 of 223 checks passed
@antiguru antiguru deleted the test_cluster_replica_sizes branch November 18, 2024 19:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants