-
Notifications
You must be signed in to change notification settings - Fork 695
[YDB CLI][CSV] Optimization: Use Apache Arrow and Protobuf Arena API as fallback in CSV upload module #19667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Vladislav0Art
wants to merge
35
commits into
ydb-platform:main
Choose a base branch
from
Vladislav0Art:vartiukhov/feature/ydb-cli/use-apache-arrow-and-arena-as-fallback
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+858
−225
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Because `TValue&& rows` is provided as rvalue reference, we can move its value into the request rather than making a copy of the data. Speed boost: ~130MiB/s -> ~168-200MiB/s.
…sertUnretryable (leaving BulkUpsert unchanged)
…na, 1-to-1 allocation)" This reverts commit b9bd7cd.
Implement additional APIs that work with arena-allocated proto messages. The implementation creates a new protobuf::Arena per batch request and builds Ydb::Value inside this arena inside TCsvParser::BuildListOnArena. Ydb::Value gets created once and then moved into Ydb::Table::BulkUpsertRequest, which is allocated on the same arena (allocation in the same arena prevents copying). Left undone/requires modifications: 1. TODO: I had to copy-paste RunDeferred/Run methods (see: RunDeferredOnArena/RunOnArena) because arena-allocated messages are returned as pointers, but the API expected rvalue/lvalue references on the request message. If the pointer is deferefenced, there will be a single copy of the message inside TGRpcConnectionsImpl::Run when WithServiceConnection is called and request is moved into capture parameter of a callback. In the best case, there must be no code duplication (the copy-pasted methods differ only in the type of accepted request: they accept TRequest*). 2. TODO: ensure correctness. Speed boost: ~130MiB/s -> ~250MiB/s
Notes: 1. Creates custom converter in `csv_parser.cpp` to build `Ydb::Value` on the arena. 2. Creates new implementer of `TValueBuilderBase` to build `TArenaAllocatedValue` in `value.cpp`/`value.h`. 3. Introduces `TValueHolder` interface with two implementations: - `StackAllocatedValueHolder` with stack-allocated `Ydb::Value`. - `TArenaAllocatedValueHolder` with arena-allocated `Ydb::Value`. Current drawbacks: - `TValueHolder` has virtual calls, which is costy; try to employ static polymorphism. - `FieldToValueOnArena` creates `TCsvToYdbOnArenaConverter`, which is a copy of `TCsvToYdbConverter`. Verdict: didn't produce any performance improvements (likely due to virtual calls in `TValueHolder`).
…ildListOnArena` This commit fixes the issue with virtual calls in `TValueHolder`, integrated in the commit 396d98f. Average speed: 254-257 MiB/s (33.1-33.6 sec spent) There seems to be no performance difference in terms of upload speed. The only difference is that the percental fraction of CPU load that `TCsvParser::BuildListOnArena` method takes: 1. this commit: 42.7% of the total CPU time. 2. commit 396d98f: 43.1% of the total CPU time. 3. without both commits: 49.3% of the total CPU time.
…hars counting Benchmarking results (see commit 6201814 for details): 1. Previous solution (at commit 6201814): Elapsed: 17.7 sec. Total read size: 8.33GiB. Average processing speed: 482MiB/s. 2. Current solution: Elapsed: 10.6 sec. Total read size: 8.33GiB. Average processing speed: 804MiB/s (up to 817MiB/s). Cherry-picked the commit: `773a3a5` See: 773a3a5
…Types with Protobuf Arena
… quote chars counting" This reverts commit d3961f2.
Requires testing via a real instance of YDBD to ensure that the refactoring is correct.
Hi! Thank you for contributing! |
🟢 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changelog entry
Changelog category
Description for reviewers
Changed: YDB CLI, CSV uploading module.
Description:
Optimized the upload speed of the CSV uploading module in YBD CLI by migrating the implementation from use of
TValue
to sending data in the Apache Arrow format, with a fallback onTValue
solution whose protobuf models are allocated on the Protobuf Arena.The fallback is a "one arena per one request" strategy, i.e. a protobuf model of every request gets allocated on its own Protobuf arena.
Upload speed improvements:
The initial upload speed is about: 134.84 MiB/s (measured on commit 33732bf79ec)
The improvements:
Methodology used for measurements:
ydb import file csv
throughput #11678).Notes:
TLinesSplitter::ConsumeLine
method yields further improvement of the CSV upload speed of 476.49 MiB/s (+253%).import.cpp
here).Relates to: #11678
Responsible: @pnv1, @asmyasnikov