From 0fd9bcdeda1a81bfb60811885e16fac49549dfb5 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Mon, 11 Aug 2025 09:44:59 +0200 Subject: [PATCH 01/16] wip --- docs/requirements/process_overview.rst | 21 +++- docs/requirements/requirements.rst | 16 +++ integration.md | 165 +++++++++++++++++++++++++ 3 files changed, 200 insertions(+), 2 deletions(-) create mode 100644 integration.md diff --git a/docs/requirements/process_overview.rst b/docs/requirements/process_overview.rst index 64e120eb..4b660a2e 100644 --- a/docs/requirements/process_overview.rst +++ b/docs/requirements/process_overview.rst @@ -5,6 +5,23 @@ Process Requirements Overview =============================== .. needtable:: - :types: tool_req - :columns: satisfies as "Process Requirement" ;id as "Tool Requirement";implemented;source_code_link + :columns: satisfies_back as "Tool Requirement"; id as "Process Requirement";tags :style: table + + r = {} + + all = {} + for need in needs: + all[need['id']] = need + + for need in needs: + if any(tag in need['tags'] for tag in ["done_automation", "prio_1_automation", "prio_2_automation", "prio_3_automation"]): + if not "change_management" in need['tags']: + # Filter out change management related requirements + r[need['id']] = need + + for tool_req in needs.filter_types(['tool_req']): + for process_req_id in tool_req['satisfies']: + r[process_req_id] = all[process_req_id] + + results = r.values() diff --git a/docs/requirements/requirements.rst b/docs/requirements/requirements.rst index 942943af..b9c97db8 100644 --- a/docs/requirements/requirements.rst +++ b/docs/requirements/requirements.rst @@ -532,6 +532,9 @@ Architecture Attributes PROCESS_gd_req__req__linkage_fulfill :parent_covered: YES + .. note:: + TODO: link targets not clear + Docs-as-Code shall enforce that linking via the ``fulfils`` attribute follows defined rules. Allowed source and target combinations are defined in the following table: @@ -980,3 +983,16 @@ Overview of Tool to Process Requirements .. needextend:: c.this_doc() and type == 'tool_req' and not status :status: valid + +.. tool_req:: Metamodel + :id: tool_req__docs_metamodel + :tags: metamodel + :implemented: YES + + Docs-as-Code shall provide a metamodel for definining config in a `metamodel.yaml` in the source code repository. + + .. note:: "satisfied by" is something like "used by" or "required by". + +.. needextend:: "metamodel.yaml" in source_code_link + :+satisfies: tool_req__docs_metamodel + :+tags: config diff --git a/integration.md b/integration.md new file mode 100644 index 00000000..7088c3e0 --- /dev/null +++ b/integration.md @@ -0,0 +1,165 @@ +# Integration Testing Workflows in Distributed Monoliths + +## Introduction + +Maintaining rapid pull‑request feedback while protecting the integrity of a distributed, component‑based system is a persistent challenge. As changes flow into the main branch, the risk of integration failures grows—especially when components are developed in separate repositories but must behave as a single system in production. + +This article explores a pragmatic approach to integration testing for systems characterized by: + +- Pull request–driven workflows +- Interdependent components maintained in separate repositories (a so-called “distributed monolith”) +- The need for end-to-end tests spanning multiple components +- A focus on developer productivity without sacrificing integration confidence +- This article does not require, but it does support operation in regulated domains such as automotive, industrial, or embedded systems + +The discussion centers on orchestrating integration tests across pull request pipelines, post-merge validations, and periodic full-system tests, with an emphasis on traceability, reproducibility, and sustained confidence. + +--- + +## Prerequisites + +This article assumes familiarity with modern software engineering practices, particularly: + +- CI/CD principles +- Git‑based workflows (feature branches, pull requests, rebases, merges) + +For foundational material, see: + +- [Modern Software Engineering – David Farley](https://www.oreilly.com/library/view/modern-software-engineering/9780137314942/) +- [Continuous Delivery – Jez Humble & David Farley](https://www.oreilly.com/library/view/continuous-delivery-reliable/9780321670250/) +- [The DevOps Handbook – Gene Kim, Patrick Debois, John Willis, and Jez Humble](https://www.oreilly.com/library/view/the-devops-handbook/9781098182281/) +- [Continuous Integration – Martin Fowler](https://martinfowler.com/articles/continuousIntegration.html) +- [The Continuous Delivery YouTube Channel](https://www.youtube.com/c/ContinuousDelivery) +- [Trunk-Based Development](https://trunkbaseddevelopment.com/) +- [Applying Trunk-Based Development in Large Teams](https://www.linkedin.com/blog/engineering/optimization/continuous-integration) +- [End-to-End Testing for Microservices – Bunnyshell](https://www.bunnyshell.com/blog/end-to-end-testing-for-microservices-a-2025-guide) +- [Scaling Distributed CI/CD Pipelines – ResearchGate](https://www.researchgate.net/publication/390157984_Scaling_Distributed_CICD_Pipelines_for_High-Throughput_Engineering_Teams_Architecture_Optimization_and_Developer_Experience) + +--- + +## Glossary + +- **Component**: A self-contained unit of code, typically a library or binary, that integrates with other components. +- **Component Test**: Tests a single component in isolation. +- **Integration Test**: Validates interactions between multiple components or subsystems. +- **Fast Tests**: Tests designed to execute in under ten minutes, providing rapid feedback. + +--- + +## Scope + +This article does not address: + +- The rationale for pull‑request–based workflows or distributed monoliths +- Specific CI/CD tooling +- Container orchestration or service mesh patterns +- Regulatory frameworks or compliance processes +- General testing theory + +--- + +## The Challenge of Integration + +Distributed monoliths look like microservices on paper—many repositories, many builds—but behave like a single system in practice. Components share APIs, schemas, and timing assumptions. They often ship together. A small change in one place can ripple across the rest. + +Standard PR pipelines validate the piece you touched but often miss the system you implicitly changed. When components are tested in isolation, the first realistic system behavior appears post‑merge—after a change meets everyone else’s. That’s late and expensive feedback. + +Component-level testing typically doe not include contract testing, with the implicit assumption that downstream integration tests will catch any issues. This undermines the fast feedback loops essential for effective development. Moreover, standard Git-based workflows validate only the changed component in isolation, not the integrated system. Coordinating changes across repositories is non-trivial, and integration failures often surface post-merge, when remediation is more disruptive. + +1. End‑to‑end tests are slow and costly. Provisioning a realistic environment, compiling a build matrix, or coordinating hardware‑in‑the‑loop can push runtimes beyond what’s practical on every PR. +2. Cross‑repository changes are common. Interface tweaks, coordinated refactorings, or schema migrations need to move in lock‑step—even though Git’s default workflows don’t know that. + +We need to bring system‑level validation forward without imposing heavy costs on every PR, and to coordinate multi‑repo changes as first‑class citizens—within a PR‑gated workflow. + +--- + +## Goals and Architectural Approach + +(That's somehow slightly redundant with the introduction?) + +To address these challenges, an effective integration workflow should: + +- Provide early, actionable feedback at the component level +- Reliably and reproducibly test cross‑component integration +- Balance test cost with coverage depth +- Scale with pull‑request–driven workflows +- Maintain traceability and visibility into what was tested, when, and why + +A central integration repository serves as the orchestrator for system‑wide builds and tests. This repository: + +- Defines the set of components to be integrated +- Acts as the source of truth for integration test configurations +- Triggers integration tests using explicit version combinations +- Serves as a gatekeeper, allowing only validated component versions into production‑bound builds + +By decoupling application logic from test orchestration, this architecture enables: + +- Separation of concerns: the integration repository contains no application code, focusing solely on orchestration +- Efficient CI pipeline design: component pipelines are distinct from cross‑component integration pipelines, reducing redundant CI overhead +- Consistent governance: updates must pass defined quality checks before acceptance, preserving system integrity without impeding local agility +- Independent component repositories: each component evolves in its own repository, with isolated development and CI +- Minimal overhead: component repositories remain lightweight, free from unnecessary shared tooling +- Improved troubleshooting: failures can be isolated to individual components or integration logic, expediting root cause analysis + +--- + +## Integration Workflows + +### Pre‑Merge Testing (Pull Requests) + +When a PR is opened or updated in a component repository, two parallel workflows are triggered: + +- Fast, component‑specific tests (unit and component‑level integration) run in the component’s CI pipeline. +- A system‑level integration workflow in the integration repository validates compatibility with the rest of the system, typically running a fast subset of the integration test suite. + +The integration repository fetches the PR branch from the component under test and combines it with the latest main branches (or last known‑good versions) of other components to form a synthetic system configuration. This configuration is then built and tested. The workflow may run in parallel with component CI (favoring rapid feedback) or sequentially (minimizing CI load), depending on project constraints. + +--- + +### Pre‑Merge Testing of Cross‑Repository Dependent Changes + +When changes in one component necessitate coordinated updates in others, the integration repository enables testing these combinations together. Related PRs across repositories are grouped, and the integration repository constructs a configuration using the relevant branches. Run the same fast subset as for single‑PR validation and report a unified status back to each PR. + +Two conventions help: + +- Group related PRs via metadata (titles, labels, or an explicit manifest) so the integration repo can discover them +- Resolve branch selection deterministically (e.g., PR branch overrides main for listed components; others stick to last known‑good) + +This turns ad‑hoc coordination into a normal operation. It reduces the risk that “the last repo to merge” breaks the system because you tested the change set as a unit before anything merges. + +--- + +### Post‑Merge Integration Validation + +After a PR is merged, the integration repository runs a fuller integration suite using the updated state. Some teams run this on every main‑branch commit; others batch changes and run on a timer. Whatever the cadence, the goal is to run a deeper suite than the pre‑merge subset and to record the exact component versions that passed. + +Two common patterns: + +- Always‑on verification: run after every merge. Failures are easy to attribute but costs are higher. +- Scheduled verification: run on a timer. Costs are lower; root cause analysis is harder. Pair this with bisect automation to identify the offending change when failures occur. + +Successful post‑merge tests confirm system stability, and the exact version tuple is recorded for future reference. This decouples verification from release, allowing components and the integrated system to be released independently as needed. + +--- + +### Conclusion + +Integrating distributed, component‑based systems in a PR‑driven workflow demands disciplined orchestration. Keep most checks close to the code. Use a central integration repository to assemble realistic compositions, run a fast subset pre‑merge, and verify deeply post‑merge. Record exactly what passed. Treat coordinated changes as first‑class. Over time, you’ll get what you need: quick PR feedback and confidence that the system still works when parts move. + +--- + +## Considered Alternatives + +### SemVer per Component + +Each component could adopt Semantic Versioning (SemVer) independently, allowing for more granular control over versioning and dependencies. This approach would enable teams to release updates at their own pace while still providing a clear framework for compatibility. + +While this sounds great, it has repeatedly failed in practice due to the complexities of managing interdependencies and ensuring compatibility across components. + +TODO: more on why it failed. Need to interview people. + +--- + +## Implementation Details + +TODO From 192f25a2ccb905975bdc5900652276d05827dcc6 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 13:54:06 +0200 Subject: [PATCH 02/16] wip --- integration.md | 38 ++++++++++---------------------------- 1 file changed, 10 insertions(+), 28 deletions(-) diff --git a/integration.md b/integration.md index 7088c3e0..93610e0f 100644 --- a/integration.md +++ b/integration.md @@ -2,17 +2,9 @@ ## Introduction -Maintaining rapid pull‑request feedback while protecting the integrity of a distributed, component‑based system is a persistent challenge. As changes flow into the main branch, the risk of integration failures grows—especially when components are developed in separate repositories but must behave as a single system in production. +This article assumes you already: (1) develop via pull requests with required checks; (2) work across multiple interdependent repositories (a distributed monolith); and (3) have a central integration repository that orchestrates cross‑component builds and tests. We treat those as prerequisites—not topics to justify. -This article explores a pragmatic approach to integration testing for systems characterized by: - -- Pull request–driven workflows -- Interdependent components maintained in separate repositories (a so-called “distributed monolith”) -- The need for end-to-end tests spanning multiple components -- A focus on developer productivity without sacrificing integration confidence -- This article does not require, but it does support operation in regulated domains such as automotive, industrial, or embedded systems - -The discussion centers on orchestrating integration tests across pull request pipelines, post-merge validations, and periodic full-system tests, with an emphasis on traceability, reproducibility, and sustained confidence. +The focus is on tightening workflows: fast pre‑merge signals, coordinated multi‑repo change handling, and post‑merge validation that produces auditable, reproducible version tuples. We skip foundational explanations and concentrate on practice. --- @@ -75,9 +67,7 @@ We need to bring system‑level validation forward without imposing heavy costs ## Goals and Architectural Approach -(That's somehow slightly redundant with the introduction?) - -To address these challenges, an effective integration workflow should: +We focus on optimizing the existing setup. Effective integration workflows should: - Provide early, actionable feedback at the component level - Reliably and reproducibly test cross‑component integration @@ -85,14 +75,14 @@ To address these challenges, an effective integration workflow should: - Scale with pull‑request–driven workflows - Maintain traceability and visibility into what was tested, when, and why -A central integration repository serves as the orchestrator for system‑wide builds and tests. This repository: +A central integration repository (assumed present) handles: -- Defines the set of components to be integrated -- Acts as the source of truth for integration test configurations -- Triggers integration tests using explicit version combinations -- Serves as a gatekeeper, allowing only validated component versions into production‑bound builds +- Defining participating components +- Holding integration test configuration +- Triggering tests for explicit version combinations +- Recording/approving validated sets for downstream use -By decoupling application logic from test orchestration, this architecture enables: +Benefits (realized when disciplined) include: - Separation of concerns: the integration repository contains no application code, focusing solely on orchestration - Efficient CI pipeline design: component pipelines are distinct from cross‑component integration pipelines, reducing redundant CI overhead @@ -152,14 +142,6 @@ Integrating distributed, component‑based systems in a PR‑driven workflow dem ### SemVer per Component -Each component could adopt Semantic Versioning (SemVer) independently, allowing for more granular control over versioning and dependencies. This approach would enable teams to release updates at their own pace while still providing a clear framework for compatibility. - -While this sounds great, it has repeatedly failed in practice due to the complexities of managing interdependencies and ensuring compatibility across components. - -TODO: more on why it failed. Need to interview people. +Each component could adopt Semantic Versioning (SemVer) independently, allowing for more granular control over versioning and dependencies. However in the end we want to verify main branches, and not tagged commits. Tagging every commit with a version number would be a rather silly replacement of git hashes. --- - -## Implementation Details - -TODO From e46f6cc2fdba86f07f35bbf8821f8c20d835054d Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 14:01:16 +0200 Subject: [PATCH 03/16] wip --- integration.md | 74 ++++++++++++++++++++++++++++++++++---------------- 1 file changed, 50 insertions(+), 24 deletions(-) diff --git a/integration.md b/integration.md index 93610e0f..fcb7bf94 100644 --- a/integration.md +++ b/integration.md @@ -2,9 +2,9 @@ ## Introduction -This article assumes you already: (1) develop via pull requests with required checks; (2) work across multiple interdependent repositories (a distributed monolith); and (3) have a central integration repository that orchestrates cross‑component builds and tests. We treat those as prerequisites—not topics to justify. +This article assumes you already: (1) develop via pull requests with required checks; (2) work across multiple interdependent repositories (a distributed monolith); and (3) have a central integration repository that orchestrates cross-component builds and tests. We treat those as prerequisites—not topics to justify. -The focus is on tightening workflows: fast pre‑merge signals, coordinated multi‑repo change handling, and post‑merge validation that produces auditable, reproducible version tuples. We skip foundational explanations and concentrate on practice. +The focus is on tightening workflows: fast pre-merge signals, coordinated multi-repo change handling, and post-merge validation that produces auditable, reproducible version tuples. We skip foundational explanations and concentrate on practice. --- @@ -13,7 +13,7 @@ The focus is on tightening workflows: fast pre‑merge signals, coordinated mult This article assumes familiarity with modern software engineering practices, particularly: - CI/CD principles -- Git‑based workflows (feature branches, pull requests, rebases, merges) +- Git-based workflows (feature branches, pull requests, rebases, merges) For foundational material, see: @@ -42,7 +42,7 @@ For foundational material, see: This article does not address: -- The rationale for pull‑request–based workflows or distributed monoliths +- The rationale for pull-request–based workflows or distributed monoliths - Specific CI/CD tooling - Container orchestration or service mesh patterns - Regulatory frameworks or compliance processes @@ -54,14 +54,14 @@ This article does not address: Distributed monoliths look like microservices on paper—many repositories, many builds—but behave like a single system in practice. Components share APIs, schemas, and timing assumptions. They often ship together. A small change in one place can ripple across the rest. -Standard PR pipelines validate the piece you touched but often miss the system you implicitly changed. When components are tested in isolation, the first realistic system behavior appears post‑merge—after a change meets everyone else’s. That’s late and expensive feedback. +Standard PR pipelines validate the piece you touched but often miss the system you implicitly changed. When components are tested in isolation, the first realistic system behavior appears post-merge—after a change meets everyone else’s. That’s late and expensive feedback. Component-level testing typically doe not include contract testing, with the implicit assumption that downstream integration tests will catch any issues. This undermines the fast feedback loops essential for effective development. Moreover, standard Git-based workflows validate only the changed component in isolation, not the integrated system. Coordinating changes across repositories is non-trivial, and integration failures often surface post-merge, when remediation is more disruptive. -1. End‑to‑end tests are slow and costly. Provisioning a realistic environment, compiling a build matrix, or coordinating hardware‑in‑the‑loop can push runtimes beyond what’s practical on every PR. -2. Cross‑repository changes are common. Interface tweaks, coordinated refactorings, or schema migrations need to move in lock‑step—even though Git’s default workflows don’t know that. +1. End-to-end tests are slow and costly. Provisioning a realistic environment, compiling a build matrix, or coordinating hardware-in-the-loop can push runtimes beyond what’s practical on every PR. +2. Cross-repository changes are common. Interface tweaks, coordinated refactorings, or schema migrations need to move in lock-step—even though Git’s default workflows don’t know that. -We need to bring system‑level validation forward without imposing heavy costs on every PR, and to coordinate multi‑repo changes as first‑class citizens—within a PR‑gated workflow. +We need to bring system-level validation forward without imposing heavy costs on every PR, and to coordinate multi-repo changes as first-class citizens—within a PR-gated workflow. --- @@ -70,9 +70,9 @@ We need to bring system‑level validation forward without imposing heavy costs We focus on optimizing the existing setup. Effective integration workflows should: - Provide early, actionable feedback at the component level -- Reliably and reproducibly test cross‑component integration +- Reliably and reproducibly test cross-component integration - Balance test cost with coverage depth -- Scale with pull‑request–driven workflows +- Scale with pull-request–driven workflows - Maintain traceability and visibility into what was tested, when, and why A central integration repository (assumed present) handles: @@ -85,7 +85,7 @@ A central integration repository (assumed present) handles: Benefits (realized when disciplined) include: - Separation of concerns: the integration repository contains no application code, focusing solely on orchestration -- Efficient CI pipeline design: component pipelines are distinct from cross‑component integration pipelines, reducing redundant CI overhead +- Efficient CI pipeline design: component pipelines are distinct from cross-component integration pipelines, reducing redundant CI overhead - Consistent governance: updates must pass defined quality checks before acceptance, preserving system integrity without impeding local agility - Independent component repositories: each component evolves in its own repository, with isolated development and CI - Minimal overhead: component repositories remain lightweight, free from unnecessary shared tooling @@ -95,46 +95,48 @@ Benefits (realized when disciplined) include: ## Integration Workflows -### Pre‑Merge Testing (Pull Requests) +### Pre-Merge Testing (Pull Requests) When a PR is opened or updated in a component repository, two parallel workflows are triggered: -- Fast, component‑specific tests (unit and component‑level integration) run in the component’s CI pipeline. -- A system‑level integration workflow in the integration repository validates compatibility with the rest of the system, typically running a fast subset of the integration test suite. +- Fast, component-specific tests (unit and component-level integration) run in the component’s CI pipeline. +- A system-level integration workflow in the integration repository validates compatibility with the rest of the system, typically running a fast subset of the integration test suite. -The integration repository fetches the PR branch from the component under test and combines it with the latest main branches (or last known‑good versions) of other components to form a synthetic system configuration. This configuration is then built and tested. The workflow may run in parallel with component CI (favoring rapid feedback) or sequentially (minimizing CI load), depending on project constraints. +The integration repository fetches the PR branch from the component under test and combines it with the latest main branches (or last known-good versions) of other components to form a synthetic system configuration. This configuration is then built and tested. The workflow may run in parallel with component CI (favoring rapid feedback) or sequentially (minimizing CI load), depending on project constraints. --- -### Pre‑Merge Testing of Cross‑Repository Dependent Changes +### Pre-Merge Testing of Cross-Repository Dependent Changes -When changes in one component necessitate coordinated updates in others, the integration repository enables testing these combinations together. Related PRs across repositories are grouped, and the integration repository constructs a configuration using the relevant branches. Run the same fast subset as for single‑PR validation and report a unified status back to each PR. +When changes in one component necessitate coordinated updates in others, the integration repository enables testing these combinations together. Related PRs across repositories are grouped, and the integration repository constructs a configuration using the relevant branches. Run the same fast subset as for single-PR validation and report a unified status back to each PR. Two conventions help: - Group related PRs via metadata (titles, labels, or an explicit manifest) so the integration repo can discover them -- Resolve branch selection deterministically (e.g., PR branch overrides main for listed components; others stick to last known‑good) +- Resolve branch selection deterministically (e.g., PR branch overrides main for listed components; others stick to last known-good) -This turns ad‑hoc coordination into a normal operation. It reduces the risk that “the last repo to merge” breaks the system because you tested the change set as a unit before anything merges. +This turns ad-hoc coordination into a normal operation. It reduces the risk that “the last repo to merge” breaks the system because you tested the change set as a unit before anything merges. --- -### Post‑Merge Integration Validation +### Post-Merge Integration Validation -After a PR is merged, the integration repository runs a fuller integration suite using the updated state. Some teams run this on every main‑branch commit; others batch changes and run on a timer. Whatever the cadence, the goal is to run a deeper suite than the pre‑merge subset and to record the exact component versions that passed. +After a PR is merged, the integration repository runs a fuller integration suite using the updated state. Some teams run this on every main-branch commit; others batch changes and run on a timer. Whatever the cadence, the goal is to run a deeper suite than the pre-merge subset and to record the exact component versions that passed. Two common patterns: -- Always‑on verification: run after every merge. Failures are easy to attribute but costs are higher. +- Always-on verification: run after every merge. Failures are easy to attribute but costs are higher. - Scheduled verification: run on a timer. Costs are lower; root cause analysis is harder. Pair this with bisect automation to identify the offending change when failures occur. -Successful post‑merge tests confirm system stability, and the exact version tuple is recorded for future reference. This decouples verification from release, allowing components and the integrated system to be released independently as needed. +Successful post-merge tests confirm system stability, and the exact version tuple is recorded for future reference. This decouples verification from release, allowing components and the integrated system to be released independently as needed. --- ### Conclusion -Integrating distributed, component‑based systems in a PR‑driven workflow demands disciplined orchestration. Keep most checks close to the code. Use a central integration repository to assemble realistic compositions, run a fast subset pre‑merge, and verify deeply post‑merge. Record exactly what passed. Treat coordinated changes as first‑class. Over time, you’ll get what you need: quick PR feedback and confidence that the system still works when parts move. +Integrating distributed, component-based systems in a PR-driven workflow demands disciplined orchestration. Keep most checks close to the code. Use a central integration repository to assemble realistic compositions, run a fast subset pre-merge, and verify deeply post-merge. Record exactly what passed. Treat coordinated changes as first-class. Over time, you’ll get what you need: quick PR feedback and confidence that the system still works when parts move. + +Releases can happen independently of integration, on any verified commit on the main branch. --- @@ -145,3 +147,27 @@ Integrating distributed, component‑based systems in a PR‑driven workflow dem Each component could adopt Semantic Versioning (SemVer) independently, allowing for more granular control over versioning and dependencies. However in the end we want to verify main branches, and not tagged commits. Tagging every commit with a version number would be a rather silly replacement of git hashes. --- + +## Realization in GitHub & Bazel (Planned) +*With concrete examples in S-CORE* + +### Pre-Merge Testing (Pull Requests) + +#### When concrete consumers are known, the integration test can be performed manually. +This mode is especially useful for tooling repositories. + +A local workflow checks out known consumers and injects the local PR branch via bazels git_override function. + +*We currently do that within docs-as-code consumer-tests*. + +#### Otherwise call integration repository workflows via workflow dispatch. + +todo + +### Pre-Merge Testing of Cross-Repository Dependent Changes + +todo; With github labels?! + +### Post-Merge Integration Validation + +tbd From 51172671e0c81f5882ce9a7b6b58484ab835d664 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 16:41:39 +0200 Subject: [PATCH 04/16] github v1 --- integration.md | 216 ++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 204 insertions(+), 12 deletions(-) diff --git a/integration.md b/integration.md index fcb7bf94..465917d1 100644 --- a/integration.md +++ b/integration.md @@ -56,7 +56,12 @@ Distributed monoliths look like microservices on paper—many repositories, many Standard PR pipelines validate the piece you touched but often miss the system you implicitly changed. When components are tested in isolation, the first realistic system behavior appears post-merge—after a change meets everyone else’s. That’s late and expensive feedback. -Component-level testing typically doe not include contract testing, with the implicit assumption that downstream integration tests will catch any issues. This undermines the fast feedback loops essential for effective development. Moreover, standard Git-based workflows validate only the changed component in isolation, not the integrated system. Coordinating changes across repositories is non-trivial, and integration failures often surface post-merge, when remediation is more disruptive. +Component-level testing typically does not include contract testing, with the implicit +assumption that downstream integration tests will catch any issues. This undermines the +fast feedback loops essential for effective development. Moreover, standard Git-based +workflows validate only the changed component in isolation, not the integrated system. +Coordinating changes across repositories is non-trivial, and integration failures often +surface post-merge, when remediation is more disruptive. 1. End-to-end tests are slow and costly. Provisioning a realistic environment, compiling a build matrix, or coordinating hardware-in-the-loop can push runtimes beyond what’s practical on every PR. 2. Cross-repository changes are common. Interface tweaks, coordinated refactorings, or schema migrations need to move in lock-step—even though Git’s default workflows don’t know that. @@ -148,26 +153,213 @@ Each component could adopt Semantic Versioning (SemVer) independently, allowing --- -## Realization in GitHub & Bazel (Planned) -*With concrete examples in S-CORE* +## Realization in GitHub (Planned) -### Pre-Merge Testing (Pull Requests) +How to implement the above patterns on GitHub. +(According to our current knowledge. We have not done so yet.) + +*Examples use Bazel (S-CORE), but the workflow patterns are tool-agnostic.* -#### When concrete consumers are known, the integration test can be performed manually. -This mode is especially useful for tooling repositories. +--- -A local workflow checks out known consumers and injects the local PR branch via bazels git_override function. +### Pre-Merge Testing (Pull Requests) -*We currently do that within docs-as-code consumer-tests*. +Two modes: (A) local consumer tests (provider-driven), (B) integration tests (integration-driven). + +#### A. Consumer Injection +Use when a repo has a well-known set of representative consumers. + +1. Clone the consumers repository. +2. Replace the dependency to the current repository with a dependency to the PR version (*for bazel that's appending `git_override` to `MODULE.bazel`*). +3. Run the relevant consumer target to verify this "small-scoped-integration". + + + +``` +git_override( + module_name = "module_under_test", + remote = "{gh_url}", + commit = "{git_pr_hash}" +) +``` + +#### B. Automated integration workflow +A pull_request in component repos triggers a `repository_dispatch` / `workflow_call` to the integration repo. + +High-level GitHub Actions outline (component repo side): +``` +name: integration-pr +on: [pull_request] +jobs: + dispatch: + runs-on: ubuntu-latest + steps: + - name: Dispatch to integration repo + uses: peter-evans/repository-dispatch@v3 + with: + token: ${{ secrets.INTEGRATION_TRIGGER_TOKEN }} + repository: eclipse-score/reference_integration + event-type: pr-integration + client-payload: >- + {"repo":"${{ github.repository }}", + "pr": "${{ github.event.pull_request.number }}", + "sha":"${{ github.sha }}"} +``` + +Integration repo receiving workflow (simplified): +``` +on: + repository_dispatch: + types: [pr-integration] +jobs: + pr-fast-subset: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Parse payload + run: | + echo '${{ toJson(github.event.client_payload) }}' > payload.json + - name: Materialize composition + run: python scripts/gen_pr_manifest.py payload.json manifest.pr.yaml + - name: Fetch component under test + run: python scripts/fetch_component.py manifest.pr.yaml # clones repo@PR SHA + - name: Render MODULE overrides + run: python scripts/render_overrides.py manifest.pr.yaml MODULE.override.bzl + - name: Bazel test (subset) + run: bazel test //integration/subset:pr_fast --override_module_files=MODULE.override.bzl + - name: Store manifest & results + uses: actions/upload-artifact@v4 + with: + name: pr-subset-${{ github.run_id }} + path: | + manifest.pr.yaml + bazel-testlogs/**/test.log +``` + +Manifest (example) written by `gen_pr_manifest.py`: +``` +pr: 482 +component_under_test: + name: docs-as-code + repo: eclipse-score/docs-as-code + sha: 6bc901f2 +others: + - name: component-a + repo: eclipse-score/component-a + ref: main + - name: component-b + repo: eclipse-score/component-b + ref: main +subset: pr_fast +timestamp: 2025-08-13T12:14:03Z +``` -#### Otherwise call integration repository workflows via workflow dispatch. -todo +--- ### Pre-Merge Testing of Cross-Repository Dependent Changes -todo; With github labels?! +Coordination mechanism: a changeset label (e.g. `changeset:feature-x`) applied to each involved PR. + +Automated discovery (label mode): integration workflow queries GitHub search API for open PRs with the same `changeset:` label across allowed repositories, then builds a manifest analogous to the single-PR manifest but with multiple `overrides` entries. + +Declarative manifest example (`changesets/feature-x`): +``` +components: + - name: users-service + repo: eclipse-score/users-service + branch: feature/new_email_index + pr: 16 + - name: auth-service + repo: eclipse-score/auth-service + branch: feature/lenient-token-parser + pr: 150 +others: + - name: billing-service + repo: eclipse-score/billing-service + ref: last_stable +subset: pr_fast +changeset: feature-x +``` + +Workflow differences vs single PR: +- Replace multiple dependencies +- Post unified status back to each PR (via a bot comment or commit status) summarizing subset result and manifest hash. + +Status semantics: all involved PRs blocked until this coordinated subset passes. + +--- ### Post-Merge Integration Validation -tbd +Trigger: push to `main` in any component repo OR scheduled (cron) in integration repo pulling latest heads. Two modes: +1. Per-merge: repository_dispatch from component merge workflow +2. Scheduled batch: hourly cron that refreshes each component repository SHA + +Workflow outline (full suite): +``` +on: + schedule: [{cron: "15 * * * *"}] + repository_dispatch: + types: [component-merged] +jobs: + full-suite: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Generate full manifest + run: python scripts/gen_full_manifest.py manifest.full.yaml + - name: Bazel test (full) + run: bazel test //integration/full:all --test_tag_filters=-flaky + - name: Persist known-good tuple (on success) + if: success() + run: python scripts/persist_tuple.py manifest.full.yaml known_good/index.json + - name: Upload artifacts + uses: actions/upload-artifact@v4 + with: + name: full-${{ github.run_id }} + path: | + manifest.full.yaml + known_good/index.json + bazel-testlogs/**/test.log +``` + +Persistence strategies: +- Commit updated `known_good/index.json` (requires a bot token) containing an array of tuples with timestamp + SHAs + manifest hash. +- Or publish a release/tag referencing the manifest artifact (immutable evidence). + +Known-good record snippet: +``` +[ + { + "timestamp": "2025-08-13T12:55:10Z", + "tuple": { + "docs-as-code": "6bc901f2", + "component-a": "91c0d4e1", + "component-b": "a44f0cd9" + }, + "manifest_sha256": "4c9b7f...", + "suite": "full", + "duration_s": 742 + } +] +``` + +On failure: attach failing manifest + summarized failing targets; optionally open (or update) a rolling issue keyed by manifest hash to avoid alert fatigue. + +--- + +### Considerations + +- Use caching to keep PR subset times predictable (bazel, ccache, etc.) +- Tag slow or flaky tests; exclude from `pr_fast`. +- Keep the subset target as an explicit target group (e.g. `pr_fast` alias) rather than relying on pattern globs—makes curation auditable via review. + +--- + +### Failure Triage Flow (Recommended) +1. PR subset fails: developer inspects manifest + specific seam test log; reproduce locally with `reproduce.sh manifest.pr.yaml`. +2. Coordinated set fails: manual investigation of all involved PRs and their logs. +3. Post-merge fails: bisect between last known‑good and current HEAD across components and component SHAs (scripted: iterate manifest permutations if necessary) then open focused issue. + +--- From 775913cb2eebf019d30d787b86dfde973fa91247 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 17:12:39 +0200 Subject: [PATCH 05/16] style rewrite --- integration.md | 345 +++++++++++++------------------------------------ 1 file changed, 92 insertions(+), 253 deletions(-) diff --git a/integration.md b/integration.md index 465917d1..a3161b46 100644 --- a/integration.md +++ b/integration.md @@ -1,192 +1,92 @@ -# Integration Testing Workflows in Distributed Monoliths +# Integration Testing in a Distributed Monolith -## Introduction +Teams often split what is functionally a single system across many repositories. Each repository can show a green build while the assembled system is already broken. This article looks at how to bring system-level feedback earlier when you work that way. -This article assumes you already: (1) develop via pull requests with required checks; (2) work across multiple interdependent repositories (a distributed monolith); and (3) have a central integration repository that orchestrates cross-component builds and tests. We treat those as prerequisites—not topics to justify. +The context here assumes three things: you develop through pull requests with required checks; you have multiple interdependent repositories that ship together; and you either have or will create a central integration repository used only for orchestration. If any of those are absent you will need to establish them first; the rest of the discussion builds on them. -The focus is on tightening workflows: fast pre-merge signals, coordinated multi-repo change handling, and post-merge validation that produces auditable, reproducible version tuples. We skip foundational explanations and concentrate on practice. +## Where Problems Usually Appear +An interface change (for example a renamed field in a shared schema) is updated in two direct consumers. Their pull requests pass. Another consumer several repositories away still depends on the old interface and only fails once the whole set of changes reaches main and a later integration run executes. The defect was present early but only visible late. Investigation now needs cross-repo log hunting instead of a quick fix while the change was still in flight. ---- +Running full end-to-end environments on every pull request is rarely affordable. Coordinated multi-repository changes are then handled informally through ad-hoc ordering: “merge yours after mine”. Late detection raises cost and makes regression origins harder to locate. -## Prerequisites +## Core Ideas +We model the integrated system as an explicit set of (component, commit) pairs captured in a manifest. We derive those manifests deterministically from events: a single pull request, a coordinated group of pull requests, or a post-merge refresh. We run a curated fast subset of integration tests for pre-merge feedback and a deeper suite after merge. When a suite passes we record the manifest (“known good”). Coordinated multi-repository changes are treated as a first-class case so they are validated as a unit rather than through merge ordering. -This article assumes familiarity with modern software engineering practices, particularly: +## Terminology +Component – a repository that participates in the assembled product (for example a service API repo or a common library). +Fast subset – a curated group of integration tests chosen to finish in single-digit minutes; for example tests that exercise protocol seams or migration boundaries. +Tuple – the mapping of component names to their commit SHAs for one integrated build; e.g. { users: a1c3f9d, billing: 9e02b4c }. +Known good – a tuple plus metadata (timestamp, suite, manifest hash) that passed a defined suite and is stored for later reproduction. -- CI/CD principles -- Git-based workflows (feature branches, pull requests, rebases, merges) +## Out of Scope +This piece does not argue for pull requests, trunk-based development, or continuous integration itself. Those are well covered elsewhere. It also does not look into any specific tools or implementations for achieving these practices. -For foundational material, see: +## A Note on History +Classic continuous integration advice assumed a single codebase. Splitting a cohesive system across repositories reintroduces many of the coordination issues CI was meant to remove. The approach here adapts familiar CI principles (frequent integration, fast feedback, reproducibility) to a multi-repository boundary. -- [Modern Software Engineering – David Farley](https://www.oreilly.com/library/view/modern-software-engineering/9780137314942/) -- [Continuous Delivery – Jez Humble & David Farley](https://www.oreilly.com/library/view/continuous-delivery-reliable/9780321670250/) -- [The DevOps Handbook – Gene Kim, Patrick Debois, John Willis, and Jez Humble](https://www.oreilly.com/library/view/the-devops-handbook/9781098182281/) -- [Continuous Integration – Martin Fowler](https://martinfowler.com/articles/continuousIntegration.html) -- [The Continuous Delivery YouTube Channel](https://www.youtube.com/c/ContinuousDelivery) -- [Trunk-Based Development](https://trunkbaseddevelopment.com/) -- [Applying Trunk-Based Development in Large Teams](https://www.linkedin.com/blog/engineering/optimization/continuous-integration) -- [End-to-End Testing for Microservices – Bunnyshell](https://www.bunnyshell.com/blog/end-to-end-testing-for-microservices-a-2025-guide) -- [Scaling Distributed CI/CD Pipelines – ResearchGate](https://www.researchgate.net/publication/390157984_Scaling_Distributed_CICD_Pipelines_for_High-Throughput_Engineering_Teams_Architecture_Optimization_and_Developer_Experience) +## Why Use a Central Integration Repository +A central repository offers a neutral place to define which components participate, to build manifests from events, to hold integration‑specific helpers (overrides, fixtures, seam tests), and to persist records of successful tuples. It should not contain business code. Keeping it small keeps review focused and reduces accidental coupling. ---- +## Workflow Layers +We use three recurring workflows: a single pull request, a coordinated subset when multiple pull requests must land together, and a post‑merge fuller suite. Each produces a manifest, runs an appropriate depth of tests, and may record the tuple if successful. -## Glossary +### Single Pull Request +When a pull request opens or updates, its repository runs its normal fast tests. The integration repository is also triggered with the repository name, pull request number, and head SHA. It builds a manifest using that SHA for the changed component and the last known-good (or main) SHAs for others, then runs the curated fast subset. The result is reported back to the pull request. The manifest and logs are stored even when failing so a developer can reproduce locally. -- **Component**: A self-contained unit of code, typically a library or binary, that integrates with other components. -- **Component Test**: Tests a single component in isolation. -- **Integration Test**: Validates interactions between multiple components or subsystems. -- **Fast Tests**: Tests designed to execute in under ten minutes, providing rapid feedback. +The subset is explicit rather than dynamically inferred. Tests in it should fail quickly when contracts or shared schemas drift. If the list grows until it is slow it will either be disabled or ignored; regular curation keeps it useful. ---- +### Coordinated Multi-Repository Subset +Some changes require multiple repositories to move together (for example a schema evolution, a cross-cutting refactor, a protocol tightening). We mark related pull requests using a stable mechanism such as a common label (e.g. changeset:feature-x). The integration workflow discovers all open pull requests sharing the label, builds a manifest from their head SHAs, and runs the same fast subset. A unified status is posted back to each pull request. None merge until the coordinated set is green. This removes informal merge ordering as a coordination mechanism. -## Scope - -This article does not address: - -- The rationale for pull-request–based workflows or distributed monoliths -- Specific CI/CD tooling -- Container orchestration or service mesh patterns -- Regulatory frameworks or compliance processes -- General testing theory - ---- - -## The Challenge of Integration - -Distributed monoliths look like microservices on paper—many repositories, many builds—but behave like a single system in practice. Components share APIs, schemas, and timing assumptions. They often ship together. A small change in one place can ripple across the rest. - -Standard PR pipelines validate the piece you touched but often miss the system you implicitly changed. When components are tested in isolation, the first realistic system behavior appears post-merge—after a change meets everyone else’s. That’s late and expensive feedback. - -Component-level testing typically does not include contract testing, with the implicit -assumption that downstream integration tests will catch any issues. This undermines the -fast feedback loops essential for effective development. Moreover, standard Git-based -workflows validate only the changed component in isolation, not the integrated system. -Coordinating changes across repositories is non-trivial, and integration failures often -surface post-merge, when remediation is more disruptive. - -1. End-to-end tests are slow and costly. Provisioning a realistic environment, compiling a build matrix, or coordinating hardware-in-the-loop can push runtimes beyond what’s practical on every PR. -2. Cross-repository changes are common. Interface tweaks, coordinated refactorings, or schema migrations need to move in lock-step—even though Git’s default workflows don’t know that. - -We need to bring system-level validation forward without imposing heavy costs on every PR, and to coordinate multi-repo changes as first-class citizens—within a PR-gated workflow. - ---- - -## Goals and Architectural Approach - -We focus on optimizing the existing setup. Effective integration workflows should: - -- Provide early, actionable feedback at the component level -- Reliably and reproducibly test cross-component integration -- Balance test cost with coverage depth -- Scale with pull-request–driven workflows -- Maintain traceability and visibility into what was tested, when, and why - -A central integration repository (assumed present) handles: - -- Defining participating components -- Holding integration test configuration -- Triggering tests for explicit version combinations -- Recording/approving validated sets for downstream use - -Benefits (realized when disciplined) include: - -- Separation of concerns: the integration repository contains no application code, focusing solely on orchestration -- Efficient CI pipeline design: component pipelines are distinct from cross-component integration pipelines, reducing redundant CI overhead -- Consistent governance: updates must pass defined quality checks before acceptance, preserving system integrity without impeding local agility -- Independent component repositories: each component evolves in its own repository, with isolated development and CI -- Minimal overhead: component repositories remain lightweight, free from unnecessary shared tooling -- Improved troubleshooting: failures can be isolated to individual components or integration logic, expediting root cause analysis - ---- - -## Integration Workflows - -### Pre-Merge Testing (Pull Requests) - -When a PR is opened or updated in a component repository, two parallel workflows are triggered: - -- Fast, component-specific tests (unit and component-level integration) run in the component’s CI pipeline. -- A system-level integration workflow in the integration repository validates compatibility with the rest of the system, typically running a fast subset of the integration test suite. - -The integration repository fetches the PR branch from the component under test and combines it with the latest main branches (or last known-good versions) of other components to form a synthetic system configuration. This configuration is then built and tested. The workflow may run in parallel with component CI (favoring rapid feedback) or sequentially (minimizing CI load), depending on project constraints. - ---- - -### Pre-Merge Testing of Cross-Repository Dependent Changes - -When changes in one component necessitate coordinated updates in others, the integration repository enables testing these combinations together. Related PRs across repositories are grouped, and the integration repository constructs a configuration using the relevant branches. Run the same fast subset as for single-PR validation and report a unified status back to each PR. - -Two conventions help: - -- Group related PRs via metadata (titles, labels, or an explicit manifest) so the integration repo can discover them -- Resolve branch selection deterministically (e.g., PR branch overrides main for listed components; others stick to last known-good) - -This turns ad-hoc coordination into a normal operation. It reduces the risk that “the last repo to merge” breaks the system because you tested the change set as a unit before anything merges. - ---- - -### Post-Merge Integration Validation - -After a PR is merged, the integration repository runs a fuller integration suite using the updated state. Some teams run this on every main-branch commit; others batch changes and run on a timer. Whatever the cadence, the goal is to run a deeper suite than the pre-merge subset and to record the exact component versions that passed. - -Two common patterns: - -- Always-on verification: run after every merge. Failures are easy to attribute but costs are higher. -- Scheduled verification: run on a timer. Costs are lower; root cause analysis is harder. Pair this with bisect automation to identify the offending change when failures occur. - -Successful post-merge tests confirm system stability, and the exact version tuple is recorded for future reference. This decouples verification from release, allowing components and the integrated system to be released independently as needed. - ---- - -### Conclusion - -Integrating distributed, component-based systems in a PR-driven workflow demands disciplined orchestration. Keep most checks close to the code. Use a central integration repository to assemble realistic compositions, run a fast subset pre-merge, and verify deeply post-merge. Record exactly what passed. Treat coordinated changes as first-class. Over time, you’ll get what you need: quick PR feedback and confidence that the system still works when parts move. - -Releases can happen independently of integration, on any verified commit on the main branch. - ---- - -## Considered Alternatives - -### SemVer per Component - -Each component could adopt Semantic Versioning (SemVer) independently, allowing for more granular control over versioning and dependencies. However in the end we want to verify main branches, and not tagged commits. Tagging every commit with a version number would be a rather silly replacement of git hashes. - ---- - -## Realization in GitHub (Planned) - -How to implement the above patterns on GitHub. -(According to our current knowledge. We have not done so yet.) - -*Examples use Bazel (S-CORE), but the workflow patterns are tool-agnostic.* - ---- - -### Pre-Merge Testing (Pull Requests) - -Two modes: (A) local consumer tests (provider-driven), (B) integration tests (integration-driven). - -#### A. Consumer Injection -Use when a repo has a well-known set of representative consumers. - -1. Clone the consumers repository. -2. Replace the dependency to the current repository with a dependency to the PR version (*for bazel that's appending `git_override` to `MODULE.bazel`*). -3. Run the relevant consumer target to verify this "small-scoped-integration". +### Post-Merge Full Suite +After merges we run a deeper suite. Some teams trigger on every push to main; others run on a schedule (for example hourly). Per-merge runs localise failures but cost more; batched runs save resources but expand the search space when problems appear (for example every two hours when resources are constrained). When the suite fails, retaining the manifest lets you bisect between the last known-good tuple and the current manifest (using a scripted search across the changed SHAs if multiple components advanced). On success we append a record for the tuple with a manifest hash and timing data. +### Manifests +Manifests are minimal documents describing the composition. They allow reconstruction of the integrated system later. +Single pull request example: +``` +pr: 482 +component_under_test: + name: docs-as-code + repo: eclipse-score/docs-as-code + sha: 6bc901f2 +others: + - name: component-a + repo: eclipse-score/component-a + ref: main + - name: component-b + repo: eclipse-score/component-b + ref: main +subset: pr_fast +timestamp: 2025-08-13T12:14:03Z +``` +Coordinated example: ``` -git_override( - module_name = "module_under_test", - remote = "{gh_url}", - commit = "{git_pr_hash}" -) +components: + - name: users-service + repo: eclipse-score/users-service + branch: feature/new_email_index + pr: 16 + - name: auth-service + repo: eclipse-score/auth-service + branch: feature/lenient-token-parser + pr: 150 +others: + - name: billing-service + repo: eclipse-score/billing-service + ref: last_stable +subset: pr_fast +changeset: feature-x ``` -#### B. Automated integration workflow -A pull_request in component repos triggers a `repository_dispatch` / `workflow_call` to the integration repo. +Large configuration belongs elsewhere; manifests should stay readable and diffable. -High-level GitHub Actions outline (component repo side): +## GitHub Realisation +*Conceptual outline; not yet implemented here.* + +Trigger from a component repository: ``` name: integration-pr on: [pull_request] @@ -201,12 +101,10 @@ jobs: repository: eclipse-score/reference_integration event-type: pr-integration client-payload: >- - {"repo":"${{ github.repository }}", - "pr": "${{ github.event.pull_request.number }}", - "sha":"${{ github.sha }}"} + {"repo":"${{ github.repository }}","pr":"${{ github.event.pull_request.number }}","sha":"${{ github.sha }}"} ``` -Integration repo receiving workflow (simplified): +Integration repository receiver (subset): ``` on: repository_dispatch: @@ -217,12 +115,11 @@ jobs: steps: - uses: actions/checkout@v4 - name: Parse payload - run: | - echo '${{ toJson(github.event.client_payload) }}' > payload.json + run: echo '${{ toJson(github.event.client_payload) }}' > payload.json - name: Materialize composition run: python scripts/gen_pr_manifest.py payload.json manifest.pr.yaml - name: Fetch component under test - run: python scripts/fetch_component.py manifest.pr.yaml # clones repo@PR SHA + run: python scripts/fetch_component.py manifest.pr.yaml - name: Render MODULE overrides run: python scripts/render_overrides.py manifest.pr.yaml MODULE.override.bzl - name: Bazel test (subset) @@ -236,67 +133,7 @@ jobs: bazel-testlogs/**/test.log ``` -Manifest (example) written by `gen_pr_manifest.py`: -``` -pr: 482 -component_under_test: - name: docs-as-code - repo: eclipse-score/docs-as-code - sha: 6bc901f2 -others: - - name: component-a - repo: eclipse-score/component-a - ref: main - - name: component-b - repo: eclipse-score/component-b - ref: main -subset: pr_fast -timestamp: 2025-08-13T12:14:03Z -``` - - ---- - -### Pre-Merge Testing of Cross-Repository Dependent Changes - -Coordination mechanism: a changeset label (e.g. `changeset:feature-x`) applied to each involved PR. - -Automated discovery (label mode): integration workflow queries GitHub search API for open PRs with the same `changeset:` label across allowed repositories, then builds a manifest analogous to the single-PR manifest but with multiple `overrides` entries. - -Declarative manifest example (`changesets/feature-x`): -``` -components: - - name: users-service - repo: eclipse-score/users-service - branch: feature/new_email_index - pr: 16 - - name: auth-service - repo: eclipse-score/auth-service - branch: feature/lenient-token-parser - pr: 150 -others: - - name: billing-service - repo: eclipse-score/billing-service - ref: last_stable -subset: pr_fast -changeset: feature-x -``` - -Workflow differences vs single PR: -- Replace multiple dependencies -- Post unified status back to each PR (via a bot comment or commit status) summarizing subset result and manifest hash. - -Status semantics: all involved PRs blocked until this coordinated subset passes. - ---- - -### Post-Merge Integration Validation - -Trigger: push to `main` in any component repo OR scheduled (cron) in integration repo pulling latest heads. Two modes: -1. Per-merge: repository_dispatch from component merge workflow -2. Scheduled batch: hourly cron that refreshes each component repository SHA - -Workflow outline (full suite): +Post-merge full suite: ``` on: schedule: [{cron: "15 * * * *"}] @@ -324,11 +161,8 @@ jobs: bazel-testlogs/**/test.log ``` -Persistence strategies: -- Commit updated `known_good/index.json` (requires a bot token) containing an array of tuples with timestamp + SHAs + manifest hash. -- Or publish a release/tag referencing the manifest artifact (immutable evidence). - -Known-good record snippet: +### Recording Known-Good Tuples +Known-good records are stored append-only. ``` [ { @@ -344,22 +178,27 @@ Known-good record snippet: } ] ``` +Persisting enables reproduction (attach manifest to a defect), audit (what exactly passed before a release), gating (choose any known-good tuple), and comparison (diff manifests to isolate drift) without relying on (rather fragile) links to unique runs in your CI system. + +## Curating the Fast Subset +Tests in the subset need to fail quickly when public seams change. Keep the list explicit (an alias such as //integration/subset:pr_fast). Remove redundant tests and quarantine flaky ones; otherwise the feedback loop becomes noisy or slow. Review the subset periodically (for example monthly or after significant interface churn) to keep its signal-to-noise high. -On failure: attach failing manifest + summarized failing targets; optionally open (or update) a rolling issue keyed by manifest hash to avoid alert fatigue. +## Handling Failures +For a failing pull request subset: inspect the manifest and failing log; reproduce locally with a script that consumes the manifest. For a failing coordinated set: treat all participating pull requests as a unit and address seam failures before merging any. For a failing post-merge full suite: bisect between the last known-good tuple and the current manifest (script permutations when more than one repository changed) to narrow the cause. Distinguish between a genuine regression and test fragility so you do not mask product issues by disabling tests. ---- +## Trade-offs and Choices +Using manifests and commit SHAs instead of assigning semantic versions to every commit keeps validation close to current heads without creating tag noise. A two-tier arrangement (subset and full) offers a clear mental model; more tiers can be added later if evidence supports them. A central orchestration repository centralises caching and secrets handling and keeps audit history straightforward. -### Considerations +## Practical Notes +Cache builds to stabilise subset runtime. Hash manifests (e.g. SHA-256) to reference runs succinctly. Provide an endpoint or badge showing the most recent known-good tuple. Generate overrides rather than editing them manually. Optionally lint the subset target to ensure only approved directories are referenced. -- Use caching to keep PR subset times predictable (bazel, ccache, etc.) -- Tag slow or flaky tests; exclude from `pr_fast`. -- Keep the subset target as an explicit target group (e.g. `pr_fast` alias) rather than relying on pattern globs—makes curation auditable via review. +## Avoiding Common Pitfalls +Selecting tests dynamically from a diff often misses schema or contract drift. Editing integration configuration manually for individual pull requests produces runs that cannot be reproduced. Relying on merge order to coordinate a multi-repository change delays detection to the last merger. ---- +## Signs It Is Working +An interface change that would break another repository fails in the subset run before merge. A coordinated schema change shows a unified status across all related pull requests. A regression introduced over several independent merges is detected by the full suite and localised quickly using stored manifests. -### Failure Triage Flow (Recommended) -1. PR subset fails: developer inspects manifest + specific seam test log; reproduce locally with `reproduce.sh manifest.pr.yaml`. -2. Coordinated set fails: manual investigation of all involved PRs and their logs. -3. Post-merge fails: bisect between last known‑good and current HEAD across components and component SHAs (scripted: iterate manifest permutations if necessary) then open focused issue. +## Summary +By expressing the integrated system as explicit manifests, curating a fast integration subset for pull requests, and running a deeper post-merge suite, you move discovery of cross-repository breakage earlier while keeping costs predictable. Each successful run leaves a reproducible record, making release selection and debugging straightforward. The approach lets a distributed codebase behave operationally like a single one. ---- +*Further reading:* Continuous Integration (Fowler), Continuous Delivery (Humble & Farley), trunk-based development resources. From b97d08318379b19feec0d417b90f86c5a3cff35a Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 17:15:53 +0200 Subject: [PATCH 06/16] rfc comment --- integration.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/integration.md b/integration.md index a3161b46..1dd8cda9 100644 --- a/integration.md +++ b/integration.md @@ -1,5 +1,7 @@ # Integration Testing in a Distributed Monolith +*RFC – Working draft. High-level overview of how our projects typically integrate (reflecting practices used in several codebases). Assumptions and trade-offs are noted; please flag gaps or over-complication so we can iterate. Easiest way for feedback is face to face!* + Teams often split what is functionally a single system across many repositories. Each repository can show a green build while the assembled system is already broken. This article looks at how to bring system-level feedback earlier when you work that way. The context here assumes three things: you develop through pull requests with required checks; you have multiple interdependent repositories that ship together; and you either have or will create a central integration repository used only for orchestration. If any of those are absent you will need to establish them first; the rest of the discussion builds on them. From aa8034a6a9eaa64d5737d83824af99e9aae687f1 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 17:18:32 +0200 Subject: [PATCH 07/16] simplify out of scope section --- integration.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/integration.md b/integration.md index 1dd8cda9..668cde92 100644 --- a/integration.md +++ b/integration.md @@ -2,7 +2,7 @@ *RFC – Working draft. High-level overview of how our projects typically integrate (reflecting practices used in several codebases). Assumptions and trade-offs are noted; please flag gaps or over-complication so we can iterate. Easiest way for feedback is face to face!* -Teams often split what is functionally a single system across many repositories. Each repository can show a green build while the assembled system is already broken. This article looks at how to bring system-level feedback earlier when you work that way. +Teams often split what is functionally a single system across many repositories. Each repository can show a green build while the assembled system is already broken. This article looks at how to bring system-level feedback earlier when you work that way. This article does not argue for pull requests, trunk-based development, or continuous integration itself. Those are well covered elsewhere. It also does not look into any specific tools or implementations for achieving these practices - except for providing a GitHub based example. The context here assumes three things: you develop through pull requests with required checks; you have multiple interdependent repositories that ship together; and you either have or will create a central integration repository used only for orchestration. If any of those are absent you will need to establish them first; the rest of the discussion builds on them. @@ -20,8 +20,6 @@ Fast subset – a curated group of integration tests chosen to finish in single- Tuple – the mapping of component names to their commit SHAs for one integrated build; e.g. { users: a1c3f9d, billing: 9e02b4c }. Known good – a tuple plus metadata (timestamp, suite, manifest hash) that passed a defined suite and is stored for later reproduction. -## Out of Scope -This piece does not argue for pull requests, trunk-based development, or continuous integration itself. Those are well covered elsewhere. It also does not look into any specific tools or implementations for achieving these practices. ## A Note on History Classic continuous integration advice assumed a single codebase. Splitting a cohesive system across repositories reintroduces many of the coordination issues CI was meant to remove. The approach here adapts familiar CI principles (frequent integration, fast feedback, reproducibility) to a multi-repository boundary. From 8d53bb12eb5bffe252f4980dcb32c308de652e0a Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 17:20:22 +0200 Subject: [PATCH 08/16] improve rfc wording --- integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/integration.md b/integration.md index 668cde92..751aa7cc 100644 --- a/integration.md +++ b/integration.md @@ -1,6 +1,6 @@ # Integration Testing in a Distributed Monolith -*RFC – Working draft. High-level overview of how our projects typically integrate (reflecting practices used in several codebases). Assumptions and trade-offs are noted; please flag gaps or over-complication so we can iterate. Easiest way for feedback is face to face!* +*RFC – Working draft. High-level overview of how our projects typically integrate (reflecting practices used in several codebases). Assumptions and trade-offs are noted; please flag gaps or disagreeable sections so we can iterate. Easiest way for feedback is face to face!* Teams often split what is functionally a single system across many repositories. Each repository can show a green build while the assembled system is already broken. This article looks at how to bring system-level feedback earlier when you work that way. This article does not argue for pull requests, trunk-based development, or continuous integration itself. Those are well covered elsewhere. It also does not look into any specific tools or implementations for achieving these practices - except for providing a GitHub based example. From 632addbd41cb87c694605b7e93a3b9073fd3d348 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 17:23:54 +0200 Subject: [PATCH 09/16] merge some chapters --- integration.md | 44 +++++++++++++++++--------------------------- 1 file changed, 17 insertions(+), 27 deletions(-) diff --git a/integration.md b/integration.md index 751aa7cc..cfa3bd5a 100644 --- a/integration.md +++ b/integration.md @@ -11,23 +11,18 @@ An interface change (for example a renamed field in a shared schema) is updated Running full end-to-end environments on every pull request is rarely affordable. Coordinated multi-repository changes are then handled informally through ad-hoc ordering: “merge yours after mine”. Late detection raises cost and makes regression origins harder to locate. -## Core Ideas -We model the integrated system as an explicit set of (component, commit) pairs captured in a manifest. We derive those manifests deterministically from events: a single pull request, a coordinated group of pull requests, or a post-merge refresh. We run a curated fast subset of integration tests for pre-merge feedback and a deeper suite after merge. When a suite passes we record the manifest (“known good”). Coordinated multi-repository changes are treated as a first-class case so they are validated as a unit rather than through merge ordering. +## Core Concepts +We model the integrated system as an explicit set of (component, commit) pairs captured in a manifest. Manifests are derived deterministically from events: a single pull request, a coordinated group of pull requests, or a post-merge refresh. A curated fast subset of integration tests provides pre-merge feedback; a deeper suite runs after merge. Passing suites produce a recorded manifest (“known good”). Coordinated multi-repository change is treated as a first-class case—we validate the set as a unit rather than relying on merge ordering. -## Terminology -Component – a repository that participates in the assembled product (for example a service API repo or a common library). -Fast subset – a curated group of integration tests chosen to finish in single-digit minutes; for example tests that exercise protocol seams or migration boundaries. -Tuple – the mapping of component names to their commit SHAs for one integrated build; e.g. { users: a1c3f9d, billing: 9e02b4c }. -Known good – a tuple plus metadata (timestamp, suite, manifest hash) that passed a defined suite and is stored for later reproduction. +Terminology (brief): +* Component – repository that participates in the assembled product (e.g. service API repo, shared library). +* Fast subset – curated integration tests finishing in single-digit minutes (protocol seams, migration boundaries, adapters). +* Tuple – mapping of component names to commit SHAs for one integrated build (e.g. { users: a1c3f9d, billing: 9e02b4c }). +* Known good – tuple + metadata (timestamp, suite, manifest hash) stored for later reproduction. +History & context: classic continuous integration assumed a single codebase; splitting one system across repositories reintroduces coordination issues CI was intended to remove. This adapts familiar CI principles (frequent integration, fast feedback, reproducibility) to a multi-repository boundary. The central integration repository is a neutral place to define participating components, build manifests, hold integration-specific helpers (overrides, fixtures, seam tests), and persist known-good records. It should not contain business logic; keeping it lean reduces accidental coupling and simplifies review. -## A Note on History -Classic continuous integration advice assumed a single codebase. Splitting a cohesive system across repositories reintroduces many of the coordination issues CI was meant to remove. The approach here adapts familiar CI principles (frequent integration, fast feedback, reproducibility) to a multi-repository boundary. - -## Why Use a Central Integration Repository -A central repository offers a neutral place to define which components participate, to build manifests from events, to hold integration‑specific helpers (overrides, fixtures, seam tests), and to persist records of successful tuples. It should not contain business code. Keeping it small keeps review focused and reduces accidental coupling. - -## Workflow Layers +## Integration Workflows We use three recurring workflows: a single pull request, a coordinated subset when multiple pull requests must land together, and a post‑merge fuller suite. Each produces a manifest, runs an appropriate depth of tests, and may record the tuple if successful. ### Single Pull Request @@ -83,7 +78,7 @@ changeset: feature-x Large configuration belongs elsewhere; manifests should stay readable and diffable. -## GitHub Realisation +## Example: GitHub Actions (Conceptual) *Conceptual outline; not yet implemented here.* Trigger from a component repository: @@ -180,23 +175,18 @@ Known-good records are stored append-only. ``` Persisting enables reproduction (attach manifest to a defect), audit (what exactly passed before a release), gating (choose any known-good tuple), and comparison (diff manifests to isolate drift) without relying on (rather fragile) links to unique runs in your CI system. -## Curating the Fast Subset -Tests in the subset need to fail quickly when public seams change. Keep the list explicit (an alias such as //integration/subset:pr_fast). Remove redundant tests and quarantine flaky ones; otherwise the feedback loop becomes noisy or slow. Review the subset periodically (for example monthly or after significant interface churn) to keep its signal-to-noise high. +## Operating It +**Curating the fast subset:** Tests should fail quickly when public seams change. Keep the list explicit (e.g. //integration/subset:pr_fast). Remove redundant tests and quarantine flaky ones; review periodically (monthly or after significant interface churn) to preserve signal. -## Handling Failures -For a failing pull request subset: inspect the manifest and failing log; reproduce locally with a script that consumes the manifest. For a failing coordinated set: treat all participating pull requests as a unit and address seam failures before merging any. For a failing post-merge full suite: bisect between the last known-good tuple and the current manifest (script permutations when more than one repository changed) to narrow the cause. Distinguish between a genuine regression and test fragility so you do not mask product issues by disabling tests. +**Handling failures:** For a failing pull request subset: inspect manifest + log; reproduce locally with a script consuming the manifest. For a failing coordinated set: treat all related pull requests as atomic. For a failing post-merge full suite: bisect between the last known-good tuple and current manifest (script permutations if multiple repositories changed) to narrow cause. Distinguish real regressions from test fragility. -## Trade-offs and Choices -Using manifests and commit SHAs instead of assigning semantic versions to every commit keeps validation close to current heads without creating tag noise. A two-tier arrangement (subset and full) offers a clear mental model; more tiers can be added later if evidence supports them. A central orchestration repository centralises caching and secrets handling and keeps audit history straightforward. +**Trade-offs and choices:** Manifests + SHAs avoid tag noise and keep validation close to heads. Two tiers (subset + full) offer a clear mental model; add more only with evidence. A central orchestration repository centralises caching, secrets, and audit history. -## Practical Notes -Cache builds to stabilise subset runtime. Hash manifests (e.g. SHA-256) to reference runs succinctly. Provide an endpoint or badge showing the most recent known-good tuple. Generate overrides rather than editing them manually. Optionally lint the subset target to ensure only approved directories are referenced. +**Practical notes:** Cache builds to stabilise subset runtime. Hash manifests (e.g. SHA-256) for concise references. Expose an endpoint or badge showing the latest known good. Generate overrides; do not hand-edit ephemeral files. Optionally lint the subset target for allowed directories. -## Avoiding Common Pitfalls -Selecting tests dynamically from a diff often misses schema or contract drift. Editing integration configuration manually for individual pull requests produces runs that cannot be reproduced. Relying on merge order to coordinate a multi-repository change delays detection to the last merger. +**Avoiding pitfalls:** Diff-based dynamic test selection often misses schema or contract drift. Ad-hoc manual edits to integration config reduce reproducibility. Merge ordering as coordination defers detection to the last merge. -## Signs It Is Working -An interface change that would break another repository fails in the subset run before merge. A coordinated schema change shows a unified status across all related pull requests. A regression introduced over several independent merges is detected by the full suite and localised quickly using stored manifests. +**Signs it is working:** Interface breakage is caught pre-merge. Coordinated change sets show unified status. Multi-repository regressions are localised rapidly using stored manifests. ## Summary By expressing the integrated system as explicit manifests, curating a fast integration subset for pull requests, and running a deeper post-merge suite, you move discovery of cross-repository breakage earlier while keeping costs predictable. Each successful run leaves a reproducible record, making release selection and debugging straightforward. The approach lets a distributed codebase behave operationally like a single one. From c540786579e7f2a29dd143b8bac86a17607ab129 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 17:39:40 +0200 Subject: [PATCH 10/16] auto format --- integration.md | 137 +++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 111 insertions(+), 26 deletions(-) diff --git a/integration.md b/integration.md index cfa3bd5a..7ce76556 100644 --- a/integration.md +++ b/integration.md @@ -1,43 +1,103 @@ # Integration Testing in a Distributed Monolith -*RFC – Working draft. High-level overview of how our projects typically integrate (reflecting practices used in several codebases). Assumptions and trade-offs are noted; please flag gaps or disagreeable sections so we can iterate. Easiest way for feedback is face to face!* +*RFC – Working draft. High-level overview of how our projects typically integrate +(reflecting practices used in several codebases). Assumptions and trade-offs are noted; +please flag gaps or disagreeable sections so we can iterate. Easiest way for feedback is +face to face!* -Teams often split what is functionally a single system across many repositories. Each repository can show a green build while the assembled system is already broken. This article looks at how to bring system-level feedback earlier when you work that way. This article does not argue for pull requests, trunk-based development, or continuous integration itself. Those are well covered elsewhere. It also does not look into any specific tools or implementations for achieving these practices - except for providing a GitHub based example. +Teams often split what is functionally a single system across many repositories. Each +repository can show a green build while the assembled system is already broken. This +article looks at how to bring system-level feedback earlier when you work that way. This +article does not argue for pull requests, trunk-based development, or continuous +integration itself. Those are well covered elsewhere. It also does not look into any +specific tools or implementations for achieving these practices - except for providing a +GitHub based example. -The context here assumes three things: you develop through pull requests with required checks; you have multiple interdependent repositories that ship together; and you either have or will create a central integration repository used only for orchestration. If any of those are absent you will need to establish them first; the rest of the discussion builds on them. +The context here assumes three things: you develop through pull requests with required +checks; you have multiple interdependent repositories that ship together; and you either +have or will create a central integration repository used only for orchestration. If any +of those are absent you will need to establish them first; the rest of the discussion +builds on them. ## Where Problems Usually Appear -An interface change (for example a renamed field in a shared schema) is updated in two direct consumers. Their pull requests pass. Another consumer several repositories away still depends on the old interface and only fails once the whole set of changes reaches main and a later integration run executes. The defect was present early but only visible late. Investigation now needs cross-repo log hunting instead of a quick fix while the change was still in flight. +An interface change (for example a renamed field in a shared schema) is updated in two +direct consumers. Their pull requests pass. Another consumer several repositories away +still depends on the old interface and only fails once the whole set of changes reaches +main and a later integration run executes. The defect was present early but only visible +late. Investigation now needs cross-repo log hunting instead of a quick fix while the +change was still in flight. -Running full end-to-end environments on every pull request is rarely affordable. Coordinated multi-repository changes are then handled informally through ad-hoc ordering: “merge yours after mine”. Late detection raises cost and makes regression origins harder to locate. +Running full end-to-end environments on every pull request is rarely affordable. +Coordinated multi-repository changes are then handled informally through ad-hoc +ordering: “merge yours after mine”. Late detection raises cost and makes regression +origins harder to locate. ## Core Concepts -We model the integrated system as an explicit set of (component, commit) pairs captured in a manifest. Manifests are derived deterministically from events: a single pull request, a coordinated group of pull requests, or a post-merge refresh. A curated fast subset of integration tests provides pre-merge feedback; a deeper suite runs after merge. Passing suites produce a recorded manifest (“known good”). Coordinated multi-repository change is treated as a first-class case—we validate the set as a unit rather than relying on merge ordering. +We model the integrated system as an explicit set of (component, commit) pairs captured +in a manifest. Manifests are derived deterministically from events: a single pull +request, a coordinated group of pull requests, or a post-merge refresh. A curated fast +subset of integration tests provides pre-merge feedback; a deeper suite runs after +merge. Passing suites produce a recorded manifest (“known good”). Coordinated +multi-repository change is treated as a first-class case—we validate the set as a unit +rather than relying on merge ordering. Terminology (brief): -* Component – repository that participates in the assembled product (e.g. service API repo, shared library). -* Fast subset – curated integration tests finishing in single-digit minutes (protocol seams, migration boundaries, adapters). -* Tuple – mapping of component names to commit SHAs for one integrated build (e.g. { users: a1c3f9d, billing: 9e02b4c }). -* Known good – tuple + metadata (timestamp, suite, manifest hash) stored for later reproduction. +* Component – repository that participates in the assembled product (e.g. service API + repo, shared library). +* Fast subset – curated integration tests finishing in single-digit minutes (protocol + seams, migration boundaries, adapters). +* Tuple – mapping of component names to commit SHAs for one integrated build (e.g. { + users: a1c3f9d, billing: 9e02b4c }). +* Known good – tuple + metadata (timestamp, suite, manifest hash) stored for later + reproduction. -History & context: classic continuous integration assumed a single codebase; splitting one system across repositories reintroduces coordination issues CI was intended to remove. This adapts familiar CI principles (frequent integration, fast feedback, reproducibility) to a multi-repository boundary. The central integration repository is a neutral place to define participating components, build manifests, hold integration-specific helpers (overrides, fixtures, seam tests), and persist known-good records. It should not contain business logic; keeping it lean reduces accidental coupling and simplifies review. +History & context: classic continuous integration assumed a single codebase; splitting +one system across repositories reintroduces coordination issues CI was intended to +remove. This adapts familiar CI principles (frequent integration, fast feedback, +reproducibility) to a multi-repository boundary. The central integration repository is a +neutral place to define participating components, build manifests, hold +integration-specific helpers (overrides, fixtures, seam tests), and persist known-good +records. It should not contain business logic; keeping it lean reduces accidental +coupling and simplifies review. ## Integration Workflows -We use three recurring workflows: a single pull request, a coordinated subset when multiple pull requests must land together, and a post‑merge fuller suite. Each produces a manifest, runs an appropriate depth of tests, and may record the tuple if successful. +We use three recurring workflows: a single pull request, a coordinated subset when +multiple pull requests must land together, and a post‑merge fuller suite. Each produces +a manifest, runs an appropriate depth of tests, and may record the tuple if successful. ### Single Pull Request -When a pull request opens or updates, its repository runs its normal fast tests. The integration repository is also triggered with the repository name, pull request number, and head SHA. It builds a manifest using that SHA for the changed component and the last known-good (or main) SHAs for others, then runs the curated fast subset. The result is reported back to the pull request. The manifest and logs are stored even when failing so a developer can reproduce locally. +When a pull request opens or updates, its repository runs its normal fast tests. The +integration repository is also triggered with the repository name, pull request number, +and head SHA. It builds a manifest using that SHA for the changed component and the last +known-good (or main) SHAs for others, then runs the curated fast subset. The result is +reported back to the pull request. The manifest and logs are stored even when failing so +a developer can reproduce locally. -The subset is explicit rather than dynamically inferred. Tests in it should fail quickly when contracts or shared schemas drift. If the list grows until it is slow it will either be disabled or ignored; regular curation keeps it useful. +The subset is explicit rather than dynamically inferred. Tests in it should fail quickly +when contracts or shared schemas drift. If the list grows until it is slow it will +either be disabled or ignored; regular curation keeps it useful. ### Coordinated Multi-Repository Subset -Some changes require multiple repositories to move together (for example a schema evolution, a cross-cutting refactor, a protocol tightening). We mark related pull requests using a stable mechanism such as a common label (e.g. changeset:feature-x). The integration workflow discovers all open pull requests sharing the label, builds a manifest from their head SHAs, and runs the same fast subset. A unified status is posted back to each pull request. None merge until the coordinated set is green. This removes informal merge ordering as a coordination mechanism. +Some changes require multiple repositories to move together (for example a schema +evolution, a cross-cutting refactor, a protocol tightening). We mark related pull +requests using a stable mechanism such as a common label (e.g. changeset:feature-x). The +integration workflow discovers all open pull requests sharing the label, builds a +manifest from their head SHAs, and runs the same fast subset. A unified status is posted +back to each pull request. None merge until the coordinated set is green. This removes +informal merge ordering as a coordination mechanism. ### Post-Merge Full Suite -After merges we run a deeper suite. Some teams trigger on every push to main; others run on a schedule (for example hourly). Per-merge runs localise failures but cost more; batched runs save resources but expand the search space when problems appear (for example every two hours when resources are constrained). When the suite fails, retaining the manifest lets you bisect between the last known-good tuple and the current manifest (using a scripted search across the changed SHAs if multiple components advanced). On success we append a record for the tuple with a manifest hash and timing data. +After merges we run a deeper suite. Some teams trigger on every push to main; others run +on a schedule (for example hourly). Per-merge runs localise failures but cost more; +batched runs save resources but expand the search space when problems appear (for +example every two hours when resources are constrained). When the suite fails, retaining +the manifest lets you bisect between the last known-good tuple and the current manifest +(using a scripted search across the changed SHAs if multiple components advanced). On +success we append a record for the tuple with a manifest hash and timing data. ### Manifests -Manifests are minimal documents describing the composition. They allow reconstruction of the integrated system later. +Manifests are minimal documents describing the composition. They allow reconstruction of +the integrated system later. Single pull request example: ``` @@ -173,22 +233,47 @@ Known-good records are stored append-only. } ] ``` -Persisting enables reproduction (attach manifest to a defect), audit (what exactly passed before a release), gating (choose any known-good tuple), and comparison (diff manifests to isolate drift) without relying on (rather fragile) links to unique runs in your CI system. +Persisting enables reproduction (attach manifest to a defect), audit (what exactly +passed before a release), gating (choose any known-good tuple), and comparison (diff +manifests to isolate drift) without relying on (rather fragile) links to unique runs in +your CI system. ## Operating It -**Curating the fast subset:** Tests should fail quickly when public seams change. Keep the list explicit (e.g. //integration/subset:pr_fast). Remove redundant tests and quarantine flaky ones; review periodically (monthly or after significant interface churn) to preserve signal. +**Curating the fast subset:** Tests should fail quickly when public seams change. Keep +the list explicit (e.g. //integration/subset:pr_fast). Remove redundant tests and +quarantine flaky ones; review periodically (monthly or after significant interface +churn) to preserve signal. -**Handling failures:** For a failing pull request subset: inspect manifest + log; reproduce locally with a script consuming the manifest. For a failing coordinated set: treat all related pull requests as atomic. For a failing post-merge full suite: bisect between the last known-good tuple and current manifest (script permutations if multiple repositories changed) to narrow cause. Distinguish real regressions from test fragility. +**Handling failures:** For a failing pull request subset: inspect manifest + log; +reproduce locally with a script consuming the manifest. For a failing coordinated set: +treat all related pull requests as atomic. For a failing post-merge full suite: bisect +between the last known-good tuple and current manifest (script permutations if multiple +repositories changed) to narrow cause. Distinguish real regressions from test fragility. -**Trade-offs and choices:** Manifests + SHAs avoid tag noise and keep validation close to heads. Two tiers (subset + full) offer a clear mental model; add more only with evidence. A central orchestration repository centralises caching, secrets, and audit history. +**Trade-offs and choices:** Manifests + SHAs avoid tag noise and keep validation close +to heads. Two tiers (subset + full) offer a clear mental model; add more only with +evidence. A central orchestration repository centralises caching, secrets, and audit +history. -**Practical notes:** Cache builds to stabilise subset runtime. Hash manifests (e.g. SHA-256) for concise references. Expose an endpoint or badge showing the latest known good. Generate overrides; do not hand-edit ephemeral files. Optionally lint the subset target for allowed directories. +**Practical notes:** Cache builds to stabilise subset runtime. Hash manifests (e.g. +SHA-256) for concise references. Expose an endpoint or badge showing the latest known +good. Generate overrides; do not hand-edit ephemeral files. Optionally lint the subset +target for allowed directories. -**Avoiding pitfalls:** Diff-based dynamic test selection often misses schema or contract drift. Ad-hoc manual edits to integration config reduce reproducibility. Merge ordering as coordination defers detection to the last merge. +**Avoiding pitfalls:** Diff-based dynamic test selection often misses schema or contract +drift. Ad-hoc manual edits to integration config reduce reproducibility. Merge ordering +as coordination defers detection to the last merge. -**Signs it is working:** Interface breakage is caught pre-merge. Coordinated change sets show unified status. Multi-repository regressions are localised rapidly using stored manifests. +**Signs it is working:** Interface breakage is caught pre-merge. Coordinated change sets +show unified status. Multi-repository regressions are localised rapidly using stored +manifests. ## Summary -By expressing the integrated system as explicit manifests, curating a fast integration subset for pull requests, and running a deeper post-merge suite, you move discovery of cross-repository breakage earlier while keeping costs predictable. Each successful run leaves a reproducible record, making release selection and debugging straightforward. The approach lets a distributed codebase behave operationally like a single one. +By expressing the integrated system as explicit manifests, curating a fast integration +subset for pull requests, and running a deeper post-merge suite, you move discovery of +cross-repository breakage earlier while keeping costs predictable. Each successful run +leaves a reproducible record, making release selection and debugging straightforward. +The approach lets a distributed codebase behave operationally like a single one. -*Further reading:* Continuous Integration (Fowler), Continuous Delivery (Humble & Farley), trunk-based development resources. +*Further reading:* Continuous Integration (Fowler), Continuous Delivery (Humble & +Farley), trunk-based development resources. From 9978a5de33bba2a52761da89876287a8c8f0dd70 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 17:55:25 +0200 Subject: [PATCH 11/16] review feedback Max --- integration.md | 40 ++++++++++++++++++++++++---------------- 1 file changed, 24 insertions(+), 16 deletions(-) diff --git a/integration.md b/integration.md index 7ce76556..2d08119c 100644 --- a/integration.md +++ b/integration.md @@ -69,9 +69,9 @@ a manifest, runs an appropriate depth of tests, and may record the tuple if succ When a pull request opens or updates, its repository runs its normal fast tests. The integration repository is also triggered with the repository name, pull request number, and head SHA. It builds a manifest using that SHA for the changed component and the last -known-good (or main) SHAs for others, then runs the curated fast subset. The result is -reported back to the pull request. The manifest and logs are stored even when failing so -a developer can reproduce locally. +known-good SHAs for others, then runs the curated fast subset. The result is reported +back to the pull request. The manifest and logs are stored even when failing so a +developer can reproduce locally. The subset is explicit rather than dynamically inferred. Tests in it should fail quickly when contracts or shared schemas drift. If the list grows until it is slow it will @@ -109,29 +109,31 @@ component_under_test: others: - name: component-a repo: eclipse-score/component-a - ref: main + ref: 34985hf8 # based on last known-good - name: component-b repo: eclipse-score/component-b - ref: main + ref: a4fd56re # based on last known-good subset: pr_fast timestamp: 2025-08-13T12:14:03Z ``` Coordinated example: ``` -components: +components_under_test: - name: users-service repo: eclipse-score/users-service branch: feature/new_email_index + ref: a57hrdfg pr: 16 - name: auth-service repo: eclipse-score/auth-service branch: feature/lenient-token-parser + ref: q928d46b75 pr: 150 others: - name: billing-service repo: eclipse-score/billing-service - ref: last_stable + ref: a4fd56re # based on last known-good subset: pr_fast changeset: feature-x ``` @@ -171,14 +173,16 @@ jobs: - uses: actions/checkout@v4 - name: Parse payload run: echo '${{ toJson(github.event.client_payload) }}' > payload.json + - name: Materialize composition - run: python scripts/gen_pr_manifest.py payload.json manifest.pr.yaml - - name: Fetch component under test - run: python scripts/fetch_component.py manifest.pr.yaml + run: gen_pr_manifest.py last_known_good.yaml payload.json > manifest.pr.yaml + - name: Render MODULE overrides - run: python scripts/render_overrides.py manifest.pr.yaml MODULE.override.bzl + run: render_overrides.py manifest.pr.yaml > MODULE.override.bzl + - name: Bazel test (subset) run: bazel test //integration/subset:pr_fast --override_module_files=MODULE.override.bzl + - name: Store manifest & results uses: actions/upload-artifact@v4 with: @@ -199,20 +203,24 @@ jobs: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - - name: Generate full manifest - run: python scripts/gen_full_manifest.py manifest.full.yaml + + - name: Generate new last_known_good.yaml + run: update_last_known_good.py last_known_good.yaml > last_known_good.yaml + - name: Bazel test (full) run: bazel test //integration/full:all --test_tag_filters=-flaky + - name: Persist known-good tuple (on success) if: success() - run: python scripts/persist_tuple.py manifest.full.yaml known_good/index.json + run: | + git add last_known_good.yaml + git commit -m "update known good" + - name: Upload artifacts uses: actions/upload-artifact@v4 with: name: full-${{ github.run_id }} path: | - manifest.full.yaml - known_good/index.json bazel-testlogs/**/test.log ``` From 105430f3bf266b80983ed710d40f152b76665cd2 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 22:02:09 +0200 Subject: [PATCH 12/16] fix special char --- integration.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/integration.md b/integration.md index 2d08119c..58f81fd3 100644 --- a/integration.md +++ b/integration.md @@ -1,6 +1,6 @@ # Integration Testing in a Distributed Monolith -*RFC – Working draft. High-level overview of how our projects typically integrate +*RFC - Working draft. High-level overview of how our projects typically integrate (reflecting practices used in several codebases). Assumptions and trade-offs are noted; please flag gaps or disagreeable sections so we can iterate. Easiest way for feedback is face to face!* @@ -42,13 +42,13 @@ multi-repository change is treated as a first-class case—we validate the set a rather than relying on merge ordering. Terminology (brief): -* Component – repository that participates in the assembled product (e.g. service API +* Component - repository that participates in the assembled product (e.g. service API repo, shared library). -* Fast subset – curated integration tests finishing in single-digit minutes (protocol +* Fast subset - curated integration tests finishing in single-digit minutes (protocol seams, migration boundaries, adapters). -* Tuple – mapping of component names to commit SHAs for one integrated build (e.g. { +* Tuple - mapping of component names to commit SHAs for one integrated build (e.g. { users: a1c3f9d, billing: 9e02b4c }). -* Known good – tuple + metadata (timestamp, suite, manifest hash) stored for later +* Known good - tuple + metadata (timestamp, suite, manifest hash) stored for later reproduction. History & context: classic continuous integration assumed a single codebase; splitting From 3050bcebf99e84f690ec72dd8a78a8fb665e60da Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 23:05:35 +0200 Subject: [PATCH 13/16] add image --- integration.md | 64 ++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 52 insertions(+), 12 deletions(-) diff --git a/integration.md b/integration.md index 58f81fd3..f741918c 100644 --- a/integration.md +++ b/integration.md @@ -1,6 +1,6 @@ # Integration Testing in a Distributed Monolith -*RFC - Working draft. High-level overview of how our projects typically integrate +*RFC – Working draft. High-level overview of how our projects typically integrate (reflecting practices used in several codebases). Assumptions and trade-offs are noted; please flag gaps or disagreeable sections so we can iterate. Easiest way for feedback is face to face!* @@ -42,13 +42,13 @@ multi-repository change is treated as a first-class case—we validate the set a rather than relying on merge ordering. Terminology (brief): -* Component - repository that participates in the assembled product (e.g. service API +* Component – repository that participates in the assembled product (e.g. service API repo, shared library). -* Fast subset - curated integration tests finishing in single-digit minutes (protocol +* Fast subset – curated integration tests finishing in single-digit minutes (protocol seams, migration boundaries, adapters). -* Tuple - mapping of component names to commit SHAs for one integrated build (e.g. { +* Tuple – mapping of component names to commit SHAs for one integrated build (e.g. { users: a1c3f9d, billing: 9e02b4c }). -* Known good - tuple + metadata (timestamp, suite, manifest hash) stored for later +* Known good – tuple + metadata (timestamp, suite, manifest hash) stored for later reproduction. History & context: classic continuous integration assumed a single codebase; splitting @@ -62,9 +62,49 @@ coupling and simplifies review. ## Integration Workflows We use three recurring workflows: a single pull request, a coordinated subset when -multiple pull requests must land together, and a post‑merge fuller suite. Each produces +multiple pull requests must land together, and a post-merge fuller suite. Each produces a manifest, runs an appropriate depth of tests, and may record the tuple if successful. +### Visual Overview +```mermaid +flowchart TB + subgraph COMP[Component Repos] + pr[PR opened / updated
<event>]:::event --> comp_ci[Component tests]:::step + + trigger1[Merge to main
<event>]:::event + end + + subgraph INT[Integration Repo] + comp_ci --> |dispatch|detect_changeset[Detect multi repository PRs]:::step + knownGood[(Known good store)]:::artifact + + %% PR + detect_changeset --> buildMan[Build PR/PRs manifest using PR/PRs SHA + known good others]:::step + knownGood --> buildMan + buildMan --> runSubset[Run fast subset of integration tests]:::step + runSubset --> prFeedback[Provide Feedback in PR / all PRs]:::step + + %% Post-merge / scheduled full suite + trigger1 -->|dispatch| fullMan[Build full manifest from latest mains of all repos]:::step + trigger2[schedule
<event>]:::event --> fullMan + fullMan --> fullSuite[Run full integration test suite]:::step + fullSuite --> fullPass{Full suite pass?}:::decision + fullPass -->|Yes| knownGood + fullPass -->|No| issue["Issue with manifest
(or a more clever automated bisect solution)"]:::status + end + + classDef event fill:#E3F2FD,stroke:#1565C0,stroke-width:1px; + classDef step fill:#FFFFFF,stroke:#607D8B,stroke-width:1px; + classDef decision fill:#FFF8E1,stroke:#F9A825,stroke-width:1px; + classDef artifact fill:#E8F5E9,stroke:#2E7D32,stroke-width:1px; + classDef status fill:#FFEBEE,stroke:#C62828,stroke-width:1px; + classDef legend fill:#F5F5F5,stroke:#BDBDBD,stroke-dasharray: 3 3; + + linkStyle default stroke:#888,stroke-width:1.2px; + class rec,knownGood artifact +``` +*High-level flow of integration workflows. Known good store feeds manifest construction for single and coordinated paths; full test suite success updates the store.* + ### Single Pull Request When a pull request opens or updates, its repository runs its normal fast tests. The integration repository is also triggered with the repository name, pull request number, @@ -88,12 +128,12 @@ informal merge ordering as a coordination mechanism. ### Post-Merge Full Suite After merges we run a deeper suite. Some teams trigger on every push to main; others run -on a schedule (for example hourly). Per-merge runs localise failures but cost more; -batched runs save resources but expand the search space when problems appear (for -example every two hours when resources are constrained). When the suite fails, retaining -the manifest lets you bisect between the last known-good tuple and the current manifest -(using a scripted search across the changed SHAs if multiple components advanced). On -success we append a record for the tuple with a manifest hash and timing data. +on a schedule (hourly seems to be a common practice). Per-merge runs localise failures +but cost more; batched runs save resources but expand the search space when problems +appear. When the suite fails, retaining the manifest lets you bisect between the last +known-good tuple and the current manifest (using a scripted search across the changed +SHAs if multiple components advanced). On success we append a record for the tuple with +a manifest hash and timing data. ### Manifests Manifests are minimal documents describing the composition. They allow reconstruction of From 75deb7dba788e98686674f95f31eb28c33039e44 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 23:10:37 +0200 Subject: [PATCH 14/16] mermaid in dark mode attempt --- integration.md | 16 ++++++---------- 1 file changed, 6 insertions(+), 10 deletions(-) diff --git a/integration.md b/integration.md index f741918c..a1d3d915 100644 --- a/integration.md +++ b/integration.md @@ -90,18 +90,14 @@ flowchart TB fullMan --> fullSuite[Run full integration test suite]:::step fullSuite --> fullPass{Full suite pass?}:::decision fullPass -->|Yes| knownGood - fullPass -->|No| issue["Issue with manifest
(or a more clever automated bisect solution)"]:::status + fullPass -->|No| issue["Issue with manifest
(or a more clever automated bisect solution)"]:::red end - classDef event fill:#E3F2FD,stroke:#1565C0,stroke-width:1px; - classDef step fill:#FFFFFF,stroke:#607D8B,stroke-width:1px; - classDef decision fill:#FFF8E1,stroke:#F9A825,stroke-width:1px; - classDef artifact fill:#E8F5E9,stroke:#2E7D32,stroke-width:1px; - classDef status fill:#FFEBEE,stroke:#C62828,stroke-width:1px; - classDef legend fill:#F5F5F5,stroke:#BDBDBD,stroke-dasharray: 3 3; - - linkStyle default stroke:#888,stroke-width:1.2px; - class rec,knownGood artifact + classDef event fill:#E3F2FD + classDef step fill:#FFFFFF + classDef decision fill:#FFF8E1 + classDef artifact fill:#E8F5E9 + classDef red fill:#FFEBEE ``` *High-level flow of integration workflows. Known good store feeds manifest construction for single and coordinated paths; full test suite success updates the store.* From 249b80b6f27dcc97a43a65f23adabfb2d9c35837 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 23:12:58 +0200 Subject: [PATCH 15/16] mermaid: use default style --- integration.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/integration.md b/integration.md index a1d3d915..78ebb82d 100644 --- a/integration.md +++ b/integration.md @@ -92,12 +92,6 @@ flowchart TB fullPass -->|Yes| knownGood fullPass -->|No| issue["Issue with manifest
(or a more clever automated bisect solution)"]:::red end - - classDef event fill:#E3F2FD - classDef step fill:#FFFFFF - classDef decision fill:#FFF8E1 - classDef artifact fill:#E8F5E9 - classDef red fill:#FFEBEE ``` *High-level flow of integration workflows. Known good store feeds manifest construction for single and coordinated paths; full test suite success updates the store.* From 27b3f3308a14ff732bce257d24c9b91f1d49a764 Mon Sep 17 00:00:00 2001 From: Alexander Lanin Date: Wed, 13 Aug 2025 23:13:47 +0200 Subject: [PATCH 16/16] mermaid: fix wording --- integration.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/integration.md b/integration.md index 78ebb82d..ad81d26e 100644 --- a/integration.md +++ b/integration.md @@ -90,7 +90,7 @@ flowchart TB fullMan --> fullSuite[Run full integration test suite]:::step fullSuite --> fullPass{Full suite pass?}:::decision fullPass -->|Yes| knownGood - fullPass -->|No| issue["Issue with manifest
(or a more clever automated bisect solution)"]:::red + fullPass -->|No| issue["Create Issue
(or a more clever automated bisect solution)"]:::red end ``` *High-level flow of integration workflows. Known good store feeds manifest construction for single and coordinated paths; full test suite success updates the store.*