Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lineage-driven fault injection #81

Open
emaxerrno opened this issue Mar 3, 2017 · 10 comments
Open

add lineage-driven fault injection #81

emaxerrno opened this issue Mar 3, 2017 · 10 comments

Comments

@emaxerrno
Copy link
Collaborator

https://people.eecs.berkeley.edu/~palvaro/molly.pdf

tl;dr: add faults and gracefully recover should be the outcome - or at least document failures so they are well understood.

i.e.: fault tolerance is a global property.

molly.pdf

@emaxerrno
Copy link
Collaborator Author

https://github.com/palvaro/molly ... and et voila

@dotnwat
Copy link
Contributor

dotnwat commented Dec 13, 2017

oh, wow this is very interesting. i had started to design a dedalus plugin for one of our projects, and i've been using molly to run those examples.. it's all very early days :) https://github.com/noahdesu/zlog/blob/dedalus/qa/dedalus/zlog.v2.ded

@emaxerrno
Copy link
Collaborator Author

I think we can add a new code generation facility to do 2 things:

  1. Generate an Oracle - which is basically a method proxy that does the accounting
  2. Add a systems headers that will either crash, drop the connection, throw exceptions, etc.

This is obviously not for scale, but for correctness.

That's my preliminary design.

Thoughts ?

@emaxerrno
Copy link
Collaborator Author

I know that sean used a config property loaded by the actor system in wallaroo

which does this for their network nemesis.

I also know that seastar has a built in disk nemesis too.

@emaxerrno
Copy link
Collaborator Author

also zlog looks awesome and you are further along than I am. The dedalus plugin looks very coo. I have not sat down and written some, but I can't wait to do it.

We should port zlog to smf! ! haha.

I know @hellertime was looking to build something similar to zlog - i pointed him to your repo.

@dotnwat
Copy link
Contributor

dotnwat commented Dec 13, 2017

  • i'd also add dropping and duplicating packets. peter is a member of our research group, so i could setup a meeting with him at some point if there are questions. i'm sure he'd be very interested in anything LDFI related.

  • the intent is to bring in smf for some key components in zlog. the tentative plan is to (1) bring in smf to replace boost asio in our sequencer, then (2) for the Ceph-based storage backend I want to build a simple proxy layer that does request aggregation across clients, and (3) build out a proper spdk-based backend for fully replicated the CORFU protocol and eliminating Ceph as a required backend.

@emaxerrno
Copy link
Collaborator Author

oh that's neat! i didn't know you were a researcher - (hadn't google) - exciting times!

RE: LDFI - duplicating packets might be more difficult than duplicating messages - i am not entirely sure how to do it within kernel for example - maybe eBPF?. For DPDK based runtime, maybe hacking the core/net/tcp.hh in seastar would be the way to do packet duplication.

replicating messages is very very easy, since we own the protocol front to back.

RE: Sequencer: back in the concord.io days we wanted to write a sequencer too for leasing token ranges and never got around it :'( - i.e.: such that only one process could make writes to a particular ( stream_topic, key, value) tuple.

RE: SPDK.io - i know that avi was thinking about writing a sestar filesystem - probably through libfuse - that uses the same IO engine in seastar so that you can just deploy one app and there would be no need for anything else - of course it would be specialized for storage apps like queues and databases. I hope to test SPDK too but the 3dXpoint drives are so expensive.

One caveat you might run into here is that the IO engine and the queue measurement is global. That is if you wanted to do multi device queue measurement is not possible today with seastar. so you'd have to raid0 the drives and hope for the best.

I think Glauber Costa was working on a multi-device patch at some point.

@dotnwat
Copy link
Contributor

dotnwat commented Dec 13, 2017

  • i meant application level messages when i said packets.
  • i'll have to pick your brain sometime about the sequencer. zlog is all about having a total global ordering, so we don't have any log sharding or topics the way kafka does.
  • those are some interesting limitations about the device queue management. i'll have too look more closely. i suspect that that would be something people will want, probably before i get around to needed it!
  • On a related note, we use spdk on just some NVMe drives and can get up around 1M IOPS. I'm not sure the cost compared to 3dxpoint.

@emaxerrno
Copy link
Collaborator Author

RE: msg replay: this is easy then :)
RE: sequencer: sure thing.
RE: spdk: i talked to intel a few months ago, and they were $1400 for consumers. Cloud providers get a big discount, but still pretty steep price for the rest of us. I think the samsung 3D NAND drives are also around the same price (a bit cheaper), though i don't think they have added it yet to the spdk project.

I think seastar could hugely benefit from integrating spdk. I haven't looked too closely but something has to tick the loop, and seastar manages the loop for DPDK, so you'll need to add the SDPK loop there too

emaxerrno added a commit that referenced this issue Dec 31, 2017
issue: #81
The idea is to have smf_gen autogen
an oracle that can do the Molly::LDFI accounting
@emaxerrno
Copy link
Collaborator Author

emaxerrno commented Dec 31, 2017

@noahdesu i just added a rough outline, thoughts welcomed!

Basically, my idea is that something is better than nothing and even if we just provide the stubs that do:

  1. message replay
  2. reordering of timing and sequence
  3. crashes

It is still significant and useful, albeit it only covers request-response protocols.

On a later design revision we can address a) more complex protocols and sequencing, b) hazard step from molly automatically by linking a SAT solver too.

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants