Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP-568: Resharding V3 #568

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

NEP-568: Resharding V3 #568

wants to merge 30 commits into from

Conversation

wacban
Copy link
Contributor

@wacban wacban commented Oct 24, 2024

NEP Status (Updated by NEP Moderators)

Status: Review

SME reviews:

@flmel flmel added the S-draft/needs-author-revision A NEP in the DRAFT stage that needs an author revision. label Oct 24, 2024
@flmel flmel changed the title NEP-XXX: Resharding v3 NEP-568: Resharding v3 Oct 24, 2024
@flmel
Copy link
Member

flmel commented Oct 28, 2024

Hi @wacban – thank you for starting this proposal. As the moderator, I labeled this PR as "Needs author revision" because we assume you are still working on it since you submitted it in "Draft" mode.
Please ping the @near/nep-moderators once you are ready for us to review it

Filling in the future possibilities section - and looking for more ideas :)
Filling the specification section about flat state.
@render render bot temporarily deployed to resharding - nomicon PR #568 November 6, 2024 15:43 Destroyed
This is complete except for section `Handling buffered receipts that
target parent shard` which is still being discussed.
neps/nep-0568.md Show resolved Hide resolved
neps/nep-0568.md Show resolved Hide resolved
Some beautification courtesy ChatGPT.

I double checked everything, we aren't changing any meaning.
@shreyan-gupta shreyan-gupta changed the title NEP-568: Resharding v3 NEP-568: Resharding V3 Dec 16, 2024
@walnut-the-cat walnut-the-cat added S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. and removed S-draft/needs-author-revision A NEP in the DRAFT stage that needs an author revision. labels Dec 16, 2024
@walnut-the-cat
Copy link
Contributor

As the moderator,

I believe this proposal is ready for SME review.

@near/wg-protocol , could you help assign SMEs who can review the proposal?

Thank you.

@shreyan-gupta shreyan-gupta marked this pull request as ready for review December 17, 2024 04:59
@shreyan-gupta shreyan-gupta requested a review from a team as a code owner December 17, 2024 04:59
@flmel flmel added the WG-protocol Protocol Standards Work Group should be accountable label Dec 18, 2024
@PiVortex PiVortex mentioned this pull request Dec 23, 2024
9 tasks
@bowenwang1996
Copy link
Collaborator

As a working group member, I nominate @birchmd and @mfornet as SME reviewers.

Copy link
Contributor

@birchmd birchmd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reviewer I would like to thank the team for putting together such a well-written and detailed proposal! Designing this feature was no easy task, and the proposal here seems to address all the intricacies nicely.

I have left two questions, but otherwise the design looks good to me.

neps/nep-0568.md Show resolved Hide resolved

#### Flat State's Status Persistence

Every shard's Flat State has a status associated with it and stored in the database, called `FlatStorageStatus`. We propose extending the existing object by adding a new enum variant named `FlatStorageStatus::Resharding`. This approach has two benefits. First, the progress of any Flat State resharding is persisted to disk, making the operation resilient to a node crash or restart. Second, resuming resharding on node restart shares the same code path as Flat State creation (see `FlatStorageShardCreator`), reducing code duplication.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the node crashes during resharding then the HybridArena MemTrie objects also need to be reconstructed to resume block processing. How does this recovery happen? I assume it is easy to reconstruct the FrozenArena from the Flat state, but what about the changes on top of it? Are they persisted somewhere? If they are not then I suppose the node would need to recreate the MemTrie by re-applying the chunks that have happened so far in the epoch.

Copy link
Contributor Author

@wacban wacban Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Trisfald Can you reply please?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the HybridArena Memtrie will need to be reconstructed upon crash or even a simple restart.

I think at the moment stage of nearcore/master we don't support seamless node restarts yet. There are a couple possible solution to this problem since all changes to the state are always persisted, in a way or another. In increasing order of complexity:

  • Wait for flat storage to complete resharding and load the memtrie for children shards as usual
  • Recreate the memtrie of the parent shard as it was at the resharding block, split it again, and apply changes
  • Create memtries from current flat storage state of children (even if not complete) and apply remaining changes
  • More complex solutions which leverage decoupling of memtrie from flat storage

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification @Trisfald . I brought this up here because crash recovery is mentioned as part of the motivation for the design. Maybe it is worth adding a section to the MemTrie specification about how restarts during resharding will be handled.

@walnut-the-cat
Copy link
Contributor

@mfornet , could you be able to review this NEP?

@walnut-the-cat
Copy link
Contributor

@mfornet , another ping to make sure it catches your eyes. As we are trying to release resharding v3 with upcoming release, it will be great if you could share your thoughts.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-NEP A NEAR Enhancement Proposal (NEP). S-review/needs-sme-review A NEP in the REVIEW stage is waiting for Subject Matter Expert review. WG-protocol Protocol Standards Work Group should be accountable
Projects
Status: In Review 👀
Status: REVIEW
Development

Successfully merging this pull request may close these issues.

10 participants