You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've been investigating improving generators by adding grammar introspection to the generator traits. This allows for an abstract tree to be constructed and reason about the generator at runtime.
This enables some interesting applications:
State space estimates
Having the whole grammar means we can produce an estimate size of the entire state space. This could then be used in a few ways.
The first could be to simply generate a report of what the size is. You could take this a step further and measure roughly how long each iteration takes and extrapolate how long it would take to brute force the space.
Another interesting application would be to update the strategy used for smaller state spaces. For example: if the space is small enough and the test is fast enough, you could just exhaustively iterate through all of the inputs and declare it completely tested.
Topological fuzzing
I've also been thinking about using the grammar to improve how we pull bytes from the fuzzing input. Some of this is inspired by https://blog.regehr.org/archives/1700.
At a high level, say you have an input like the following:
enumExpr{Bool(bool),Not(Box<Expr>),}
Currently, the decision as to which variant to use at each level is made in isolation. It looks something like this
match driver.gen(0..2){0 => Expr::Bool(driver.gen()),1 => Expr::Not(driver.gen()),}
This causes the topology selection to be biased towards smaller inputs, rather than being uniform across all possible grammars
This is really well stated on that blog post:
Let’s consider this very simple decision tree:
If we use it to generate output a bunch of times, we’ll end up getting outputs with percentages something like this:
A 49.979
B 25.0018
C 12.509
D 6.26272
E 3.1265
F 3.12106
What this means in practice is we spend a large amount of time generating shallow inputs we've already seen, rather than exploring deeper trees. This becomes even more of an issue once you start using Vecs in the input.
This can be solved by inspecting the grammar at the beginning of a fuzzing run and producing weights for each variant based on their relative sub tree topologies. This will produce uniformly distributed shapes.
A really nice property with this is we can map these topologies to binary representations. In this way, each topology is assigned a unique identifier. This should work quite well with fuzzers, where we read the identifier at the beginning of the input, followed by a fixed separator, followed by a fix pool of values for the nodes on the topology
let topology = slice.split(4);// in this example the topology space fits in 4 bytesif slice.split(4) != "Blro"{return}match topology {0 => // generate the values that are needed for this topology// Others go here}
Note: the above is pseudocode that highlights the concept, not necessarily how it will be implemented.
Something that I want to highlight here is this means that the generator branches once at the very beginning of the input and every other value extracted after that will be fixed. I think this should be quite powerful, since without it, random changes at random offsets will completely change how bytes are interpreted and can be quite difficult for a fuzzing engine to track.
This looks interesting! I haven't had the time to look through all the code, but it seems to have the potential to be the future of fuzzer input generation :)
Hope it works out well in practice, compared to other ways of generating the input!
@camshaft Is it possible to share more details about your proposed algorithm? Looking at the code is not very illuminating to me. :(
Some prior work on "k-paths" proposes an on-line (feedback-directed) algorithm for generating tests that cover portions of a grammar (in the case of Bolero, an enum type definition). Another option would be to analyze the grammar/enum and choose probabilities for different branches based on the hoped-for input depth and diversity. It seems like you are angling for this option. But I can't tell how it would work, generally, based on your example, since you don't show the connection to the type definition (at least, it doesn't seem like the topology part and the Expr part are related, perhaps I'm missing something).
I've been investigating improving generators by adding grammar introspection to the generator traits. This allows for an abstract tree to be constructed and reason about the generator at runtime.
This enables some interesting applications:
State space estimates
Having the whole grammar means we can produce an estimate size of the entire state space. This could then be used in a few ways.
The first could be to simply generate a report of what the size is. You could take this a step further and measure roughly how long each iteration takes and extrapolate how long it would take to brute force the space.
Another interesting application would be to update the strategy used for smaller state spaces. For example: if the space is small enough and the test is fast enough, you could just exhaustively iterate through all of the inputs and declare it completely tested.
Topological fuzzing
I've also been thinking about using the grammar to improve how we pull bytes from the fuzzing input. Some of this is inspired by https://blog.regehr.org/archives/1700.
At a high level, say you have an input like the following:
Currently, the decision as to which variant to use at each level is made in isolation. It looks something like this
This causes the topology selection to be biased towards smaller inputs, rather than being uniform across all possible grammars
This is really well stated on that blog post:
What this means in practice is we spend a large amount of time generating shallow inputs we've already seen, rather than exploring deeper trees. This becomes even more of an issue once you start using Vecs in the input.
This can be solved by inspecting the grammar at the beginning of a fuzzing run and producing weights for each variant based on their relative sub tree topologies. This will produce uniformly distributed shapes.
A really nice property with this is we can map these topologies to binary representations. In this way, each topology is assigned a unique identifier. This should work quite well with fuzzers, where we read the identifier at the beginning of the input, followed by a fixed separator, followed by a fix pool of values for the nodes on the topology
Note: the above is pseudocode that highlights the concept, not necessarily how it will be implemented.
Something that I want to highlight here is this means that the generator branches once at the very beginning of the input and every other value extracted after that will be fixed. I think this should be quite powerful, since without it, random changes at random offsets will completely change how bytes are interpreted and can be quite difficult for a fuzzing engine to track.
I've implemented a WIP prototype of these concepts in https://github.com/camshaft/bolero/tree/grammars.
The text was updated successfully, but these errors were encountered: