-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add and use BytestringProvider
in fuzz_one_input
#4221
Conversation
ba1aa8d
to
82d572b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, but I'd be inclined to make it simpler where we can - for example, just drawing zero or one bits for any boolean.
More generally we probably want to abort rather than retrying whenever we reasonably can for the fuzzing use-case, but maybe we leave that out of BytestringProvider
and instead construct a special-purpose AtherisProvider
in a future PR.
(I'd also like to generate a starter corpus for Atheris using the HypothesisBackend, but atheris.FuzzedDataProvider
is in C++ and a moderate effort to invert, ugh)
I'm hesitant to ignore boolean weights because it makes creating long collections (and preserving them in mutations by not truncating via an early boolean -> false) unlikely. Though now that you bring it up respecting integer weights is probably net-negative due to changing draw size, and I agree that aborting rather than rejection sampling OOB ints is a good idea, though am 50/50 on if that should live here or in an atheris provider.
If the suggested workflow running hypothesis to generate a corpus and then separately running pure atheris with that corpus, then idt inverting FDP is possible, because the user may write any of many possible atheris strategies by combining |
Yeah, I could deal with drawing up to one byte for use as a boolean, but would probably want to resolve this via some kind of mini-benchmark. Conceptually we're not inverting the FDP, only our provider - it'd look like |
The results for lists are more or less what you'd expect - by default our p_continue is around 0.83, so the distribution falls out as standard geometric with either p=0.83 (drawing up to 8 bits) or p=0.5 (drawing only one bit), with the latter generating half as empty lists. I expect this to generalize to mutation, since any mutation on a p=0.5 boolean has a 50% chance to truncate the list at that position. c = Counter()
@given(st.lists(st.integers()))
def f(l):
c[len(l)] += 1
for _ in range(100):
b = random.randbytes(1000)
f.hypothesis.fuzz_one_input(b)
print(c) # current
Counter({2: 15, 0: 11, 3: 9, 1: 9, 5: 8, 4: 7, 8: 6, 13: 6, 7: 5, 11: 4, 6: 3, 12: 2, 20: 2, 14: 2, 18: 2, 17: 2, 15: 2, 21: 1, 25: 1, 10: 1, 22: 1, 9: 1})
# return self._draw_bits(1) == 0
Counter({0: 45, 1: 32, 2: 8, 3: 7, 4: 5, 5: 2, 7: 1})
I'm just not sure why FDP is involved - We could use I've added a dependency on #4241 for a clean merge. |
Two intuitions:
I think this points pretty strongly towards using the FDP to implement our provider, and then also implementing |
I'm just not sure how this could be done! The kwargs of our typed choice sequence interface is richer than FDP supports, and once you tack on implementing all the kwargs with atomic FDP I agree with both of your points - but I would still advocate for 8- instead of 1-bit booleans, on the grounds that the byte size is still constant and that it helps the fuzzer produce and maintain good collection sizes (sure, by a probably-constant factor, but I'll take it). I'm happy to ignore integer |
Yeah, fair enough. I was thinking that'd we'd make very liberal use of early-exit, but practically speaking we'll always need some internal retries too... |
Just fyi that as the last consumer of If you still have reservations about the bytestring provider then lmk. I'd be open to doing more stupidly-simple things now (cast 8 bytes directly to a float via f64 bit representation, no float_to_lex or range scaling) now, and benchmarking improvements later, if you're skeptical. I don't want to spend a lot of time benchmarking different approaches in a full fuzzing integration now, but we can leave that for later. The most important thing rn is getting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's ship this and we can revisit later if/when people complain 🙂
bits = min(bits, 8) | ||
|
||
size = 2**bits | ||
falsey = math.floor(size * (1 - p)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
falsey = math.floor(size * (1 - p)) | |
falsey = max(1, math.floor(size * (1 - p))) |
don't want to round low probabilities to zero!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also realized checking this that the bits calculation is very biased now that we don't recurse. I've replaced it with a constant 8 bits, which I believe is equivalent for p = {0.5, 0.75, ...} (only (top n) MSB matters <-> only draw_bit(n)
matters) and is closer to the passed p
otherwise.
This provider is "the simplest thing that works". There is no upweighting for ascii characters or small integers. Step one is getting
fuzz_one_input
off the buffer (this pull), and so I'm not yet putting any effort into optimizing the provider for fuzzing. In fact, arguably the simplest thing will end up performing better than most more intelligent things, as fuzzers tend to do well at discovering behavior iff choice size remains constant.