Skip to content

Commit aeb27f0

Browse files
authored
artifact store: fix correctness bug; remove copy read timeout (#7866)
This fixes the two main problems identified in #7796: - In some situations a truncated artifact can be persisted to the artifact store. (yikes!) - Uploading a real TUF repo to an a4x2 cluster results in lots of read timeouts. In the previous implementation, we always wrote temporary files to `tmp/{sha256}`, and returned an error if that file was already present. Due to the Drop behavior of Utf8TempFile, what I think was my attempting to simplify the code resulted in deleting another task's temporary file when we return an "already in progress" error. (This PR adds a regression test for this issue, which I've verified fails as expected on the current implementation.) This replaces the temporary file persisting logic with the atomicwrites crate as suggested by @sunshowers, which is resistant to trivial mistakes like this. AtomicFile::write takes a function which must perform the entire write operation; if the function returns an error, the file is not renamed to the final path. To make it work in an asynchronous context, AtomicFile::write is placed on a blocking task and bytes are sent over an mpsc channel. The write task also creates a checksum of the data, returning an error to prevent persisting the file if the checksum is invalid. atomicwrites also syncs the file as well as the parent directories, meaning we can remove that code in our implementation. Since the logic for detecting multiple in-progress transfers relied on the predictable naming of temporary files, I removed that. [The original reasoning for checking this][1] was to avoid doing unnecessary work. The current implementation allows writing an artifact that already exists though, so long as someone else isn't trying to write it at the same time. (I intentionally allowed overwriting an existing artifact to allow Nexus to work around an incorrectly-written artifact in the store, if it noticed such a thing happening; it doesn't currently.) I don't know which is better; in theory allowing multiple writers means that if one fails spuriously the other that's running could still succeed. It would be simple enough to add the proposed change in #7860 to this PR as well. The reason the 15 second read timeout was in place for copy requests was to avoid it entirely holding up any other attempts to replicate the artifact. Since we would no longer stop these attempts the read timeout can be removed. [1]: a5052d0#r1795902195
1 parent a5dd6d1 commit aeb27f0

File tree

3 files changed

+230
-181
lines changed

3 files changed

+230
-181
lines changed

Cargo.lock

Lines changed: 1 addition & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

sled-agent/Cargo.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ workspace = true
1111
[dependencies]
1212
anyhow.workspace = true
1313
async-trait.workspace = true
14+
atomicwrites.workspace = true
1415
base64.workspace = true
1516
bootstore.workspace = true
1617
bootstrap-agent-api.workspace = true

0 commit comments

Comments
 (0)