Revise some of the docs, add information about upgrading (#876)

jhump · web-flow · commit 16c2d4856064 · 2024-05-28T23:28:52.000Z
I was reviewing some of the docs and noticed a few things that could be
improved.

The changes to the bottom section of the README stem from
thinking about other kinds of changes that we must reasonably make from
time to time that could prevent a user from easily being able to upgrade
to a newer version of the suite. I figured being very explicit about
this in this document is best for clarity and transparency.

This also adds a section to the README that describes the machinery of
the thing and also includes a sequence diagram to show what's going on
inside the test runner.

And it adds a section in other docs that describe the process for upgrading
to a new release of the conformance suite.
diff --git a/README.md b/README.md
@@ -33,6 +33,81 @@ started with any of these tasks, you'll want to read one or more of these guides
 * [Testing Client Implementations](./docs/testing_clients.md)
 * [Authoring New Test Cases](./docs/authoring_test_cases.md)
 
+## How it works
+
+The tests are data-driven: all test cases are defined in YAML files in this repo. These files
+get embedded in the test runner so that the single self-contained executable contains all of
+the test case data.
+
+The test runner first processes your configuration and uses that to select which test cases
+are relevant. Even if a test case is known to fail, it will still be executed to make sure it
+is still failing (and report the fact if the test actually passes).
+
+It then groups all of the test cases by the server configuration needed. So test cases that
+will use TLS and the Connect protocol are in a different group from test cases that do _not_
+use TLS and use the gRPC protocol.
+
+It then begins running the tests.
+
+```mermaid
+sequenceDiagram
+  actor user
+    create participant test runner
+    user ->> test runner: run conformance suite
+
+    create participant client
+    test runner -->> client: start process
+
+    rect rgb(255,250,240)
+    loop for each server config
+        create participant server
+        test runner -->> server: start process
+        test runner ->> server: send config via stdin
+        server ->> test runner: send result via stdout
+
+        rect rgb(240,255,240)
+        loop for each test case
+            test runner ->>+ client: send RPC details via stdin
+            client ->>+ server: invoke RPC, send request(s)
+            server ->> server: process RPC
+            server ->>- client: send response(s)
+            client ->>- test runner: send RPC results via stdout
+            test runner ->> test runner: assess RPC results
+        end
+        end
+
+        destroy server
+        test runner --x server: terminate
+    end
+    end
+
+    destroy client
+    test runner --x client: terminate
+
+    destroy test runner
+    test runner ->> user: report results
+```
+
+It first starts a client process (either a client under test, if in client mode, or a
+reference client).
+
+For each server configuration, it starts a server process (either a server under test, if
+in server mode, or a reference server). It sends the server configuration details by writing
+them to the process's _stdin_. When the server is listening on the network and ready to
+accept RPCs, it sends the details to the test runner by writing to its _stdout_.
+
+For each test case that applies to this server configuration, it adds details to the test
+case data with the server's address, so the client will know how to reach it. It then
+sends the test case data to the client by writing them to the process's _stdin_. The
+client then invokes the RPC. It reports the RPC results to the test runner by writing
+them to its _stdout_.
+
+The test runner decides whether the test case was successful or not by comparing the
+RPC results against expected results.
+
+After all tests have been run and all child processes stopped, it reports the
+results.
+
 ## Testing your implementation
 
 ### Setup
@@ -135,7 +210,7 @@ This will build the necessary binaries and run tests of the following implementa
     confirm interoperability with official gRPC implementations.
 
 Both of the above clients are tested against both the Connect reference server and the gRPC server.
-The servers are tested against the Connect reference client and the gRPC client. And since the gRPC
+Both servers are tested against the Connect reference client and the gRPC client. And since the gRPC
 client does not support gRPC-Web, the servers are also tested against the official gRPC-Web JS client.
 
 ## Status: Stable
@@ -149,7 +224,21 @@ formats, or the Protobuf messages used by clients and servers under test in the
 Note, however, that we reserve the right to rename, remove, or re-organize
 individual test cases, which may impact the "known failing" and "known flaky"
 configurations for an implementation under test. We will document these changes
-in the [release notes](https://github.com/connectrpc/conformance/releases).
+in the [release notes].
+
+We also intend to occasionally add new test cases, and occasionally these
+additions may also necessitate updates to the Protobuf schemas (such as new
+request or response fields). The Protobuf changes will remain compatible, so
+your programs will continue to compile, but actually passing new/updated test
+cases may require updates to your program, to incorporate the new fields into
+the behavior of the client or server under test. These kinds of changes will
+also be documented in the releases notes.
+
+New test cases in a release could also reveal previously undetected conformance
+issues which may require fixes to the implementations you are testing. So while
+we aim for backwards-compatibility and making it easy to upgrade to new releases
+of the conformance suite, it is expected that some releases may incur some effort
+to adopt. (See [the docs][upgrading] for more details.)
 
 ## Ecosystem
 
@@ -197,3 +286,5 @@ Offered under the [Apache 2 license][license].
 [docs]: https://connectrpc.com
 [license]: https://github.com/connectrpc/conformance/blob/main/LICENSE
 [protobuf-es]: https://github.com/bufbuild/protobuf-es
+[release notes]: https://github.com/connectrpc/conformance/releases
+[upgrading]: ./docs/configuring_and_running_tests.md#upgrading
diff --git a/docs/configuring_and_running_tests.md b/docs/configuring_and_running_tests.md
@@ -153,7 +153,7 @@ A single case is defined by the following properties:
 
 A single set of features is expanded into one or more (usually many more) config cases.
 For example, if the features support HTTP 1.1 and HTTP/2, all three protocols, all
-stream types, identity and gzip encoding, and TLS, that results in 2*3*5*2*2 = 120
+stream types, identity and gzip encoding, and TLS, that results in 2×3×5×2×2 = 120
 combinations. Some of those combinations may not be valid (such as full-duplex
 bidirectional streams over HTTP 1.1, or gRPC over HTTP 1.1), so the total number of
 config cases would be close to 120 but not quite.
@@ -208,9 +208,9 @@ Let's dissect this line-by-line:
   anything that _looks_ like an option, is actually a positional argument.
 * `./path/to/client/program --some-flag-for-client-program`: The positional arguments
   represent the command to invoke in order to run the client under test. The first
-  token must be the path to the executable. Any other arguments are passed to that
-  executable as arguments. So in this case, `--some-flag-for-client-program` is an
-  option that our client under test understands.
+  token must be the path to the executable. Any subsequent arguments are passed as
+  arguments to that executable. So in this case, `--some-flag-for-client-program` is
+  an option that our client under test understands.
 
 Common reasons to pass arguments to the client or server under test are:
 1. To control verbosity of log output. When troubleshooting an implementation, it
@@ -331,9 +331,9 @@ If you provide a `-v` option to the test runner, it will print some other messag
 running:
 ```text
 Computed 44 config case permutations.
-Loaded 1 known failing test cases/patterns.
-Loaded 8 test suites, 97 test case templates.
-Computed 602 test case permutations across 10 server configurations.
+Loaded 8 test suite(s), 97 test case template(s).
+Loaded 1 known failing test case pattern(s) that match 4 test case permutation(s).
+Computed 602 test case permutation(s) across 10 server configuration(s).
 Running 47 tests with reference server for server config {HTTP_VERSION_1, PROTOCOL_CONNECT, TLS:false}...
 Running 47 tests with reference server for server config {HTTP_VERSION_1, PROTOCOL_CONNECT, TLS:true}...
 Running 46 tests with reference server for server config {HTTP_VERSION_1, PROTOCOL_GRPC_WEB, TLS:false}...
@@ -349,11 +349,12 @@ Running 46 tests with reference server (grpc) for server config {HTTP_VERSION_2,
 Running 46 tests with reference server (grpc) for server config {HTTP_VERSION_2, PROTOCOL_GRPC_WEB, TLS:false}...
 ```
 This shows a summary of the config as it is loaded and processed, telling us the total number of
-[config cases](#config-cases) that apply to the current configuration (44) and the number of patterns that
-identify "known failing" cases. It shows us the total number of test suites (8) and the total number of test
-cases across those suites (97). The next line shows us that it has used the 44 relevant config cases and 97
-test case templates to compute a total of 602 [test case permutations](#test-case-permutations). This means
-that the client under test will be invoking 602 RPCs.
+[config cases](#config-cases) that apply to the current configuration (44), the total number of test suites (8),
+and the total number of test cases across those suites (97). It then shows the number of patterns
+provided to identify "known failing" cases (1), and the number of test cases that matched the "known
+failing" patterns (4). The next line shows us that it has used the 44 relevant config cases and 97
+test case templates to compute a total of 602 [test case permutations](#test-case-permutations). This
+means that the client under test will be invoking 602 RPCs.
 
 The remaining lines in the example output above are printed as each test server is started. Each server config
 represents a different RPC server, started with the given configuration (since we are running the tests using
@@ -447,6 +448,11 @@ flaky test cases, use `--known-flaky` (instead of `--skip`). Use of `--run` or `
 configurations is discouraged. It should instead be possibly to correctly filter the set of tests
 to run just based on config YAML files.
 
+One reason one might need to use `--skip` in a CI configuration is if a bug in the implementation
+under test causes the client or server to crash or to deadlock. Since such bugs could prevent the
+conformance suite from ever completing successfully (even if such tests are marked as "known
+failing"), it may be necessary to temporarily skip them in CI until those bugs are fixed.
+
 ## Configuring CI
 
 The easiest way to run conformance tests as part of CI is to do so from a container that has the
@@ -500,6 +506,40 @@ If you have multiple test programs, such as both a client and a server, or even
 different sets of arguments, you should name the relevant config YAML and known-failing files so
 it is clear to which invocation they apply.
 
+## Upgrading
+
+When a new version of the conformance suite is released, ideally, you could simply update
+the version number you are using and everything just works. We aim for
+backwards-compatibility between releases to maximize the chances of this ideal outcome. But
+there are a number of things that can happen in a release that make the process a little
+more laborious:
+
+* As a matter of hygiene/maintenance, we may rename and re-organize test suites and test
+  cases. This means that any test case patterns that are part of your configuration (like
+  known-failing files) may need to be updated. We don't expect this to happen often, but
+  when it does, we will include information in the release notes to aid in updating your
+  configuration.
+* The new version may contain new/updated test cases that require some changes in the
+  behavior/logic of your implementations under test. This might be for testing new
+  functionality that requires new fields in the conformance protocol messages. Without
+  changes in your client or server under test, the new test cases will likely fail.
+* The new version may contain new/updated test cases that reveal previously undetected
+  conformance failures.
+
+To minimize disruption when upgrading, we recommend a process that looks like so:
+1. Update to the new release of the conformance suite.
+2. Update test case patterns (like in known-failing configurations) if necessary to match
+   any changes to test case names and organization.
+3. Update/add known-failing configurations for any new failures resulting from new/updated
+   test cases.
+4. **Commit/merge the upgrade.**
+5. File bugs for the new failures.
+6. As the bugs are fixed, update the known-failing configurations as you go.
+
+By simply marking all new failures as "known failing" and filing bugs for them, it should
+allow you to upgrade to a new release quickly. You can then decide on the urgency of fixing
+the new failures and prioritize accordingly.
+
 [config-proto]: https://buf.build/connectrpc/conformance/docs/main:connectrpc.conformance.v1#connectrpc.conformance.v1.Config
 [configcase-proto]: https://buf.build/connectrpc/conformance/docs/main:connectrpc.conformance.v1#connectrpc.conformance.v1.ConfigCase
 [connect-protocol]: https://connectrpc.com/docs/protocol/
diff --git a/docs/testing_clients.md b/docs/testing_clients.md
@@ -199,9 +199,9 @@ ignored (and the headers and/or trailers treated as empty).
 ### Cancellation
 
 The `ClientCompatRequest` can contain instructions for the client program to cancel the
-RPC before it has completed. When the RPC is canceled based on these instructions, the
-rest of the invocation logic should proceed as if it had _not_ been canceled. This way,
-the client program exercises the client implementation's cancellation handling, and how
+RPC before it has completed. Ideally, when the RPC is canceled based on these instructions,
+the rest of the invocation logic should proceed as if it had _not_ been canceled. This way,
+the client program can exercise the client implementation's cancellation handling, and how
 it impacts subsequent operations for the call. This allows the conformance suite to verify
 that asynchronous cancellations are handled correctly by the implementation and result in
 proper notification of the cancellation to the code that is consuming the RPC results.
@@ -241,6 +241,7 @@ invoke the method using the given request headers,
    delay the indicated number of milliseconds   
    cancel the RPC (but do not return)
 }
+receive the response
 
 if the operation fails {
    abort, returning a result that describes the error and any
@@ -252,9 +253,9 @@ construct a result using the payload and any available headers
 ```
 
 _*_ Note: some client APIs will provide a blocking operation for unary RPCs,
-    which doesn't return until the RPC is complete. For these cases, you must
-    arrange for the RPC to be canceled asynchronously after the indicated
-    number of milliseconds, and then invoke the unary operation.
+    which doesn't return until the RPC response is received. For these cases,
+    you must arrange for the RPC to be canceled asynchronously after the indicated
+    number of milliseconds, and then invoke the blocking operation.
 
 #### Client stream