Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add and use BytestringProvider in fuzz_one_input #4221

Merged
merged 4 commits into from
Jan 18, 2025

Conversation

tybug
Copy link
Member

@tybug tybug commented Dec 29, 2024

This provider is "the simplest thing that works". There is no upweighting for ascii characters or small integers. Step one is getting fuzz_one_input off the buffer (this pull), and so I'm not yet putting any effort into optimizing the provider for fuzzing. In fact, arguably the simplest thing will end up performing better than most more intelligent things, as fuzzers tend to do well at discovering behavior iff choice size remains constant.

@tybug tybug requested a review from Zac-HD as a code owner December 29, 2024 22:50
@tybug tybug force-pushed the atheris-tcs branch 4 times, most recently from ba1aa8d to 82d572b Compare December 30, 2024 03:12
Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but I'd be inclined to make it simpler where we can - for example, just drawing zero or one bits for any boolean.

More generally we probably want to abort rather than retrying whenever we reasonably can for the fuzzing use-case, but maybe we leave that out of BytestringProvider and instead construct a special-purpose AtherisProvider in a future PR.

(I'd also like to generate a starter corpus for Atheris using the HypothesisBackend, but atheris.FuzzedDataProvider is in C++ and a moderate effort to invert, ugh)

@tybug
Copy link
Member Author

tybug commented Jan 6, 2025

Overall looks good, but I'd be inclined to make it simpler where we can - for example, just drawing zero or one bits for any boolean.

I'm hesitant to ignore boolean weights because it makes creating long collections (and preserving them in mutations by not truncating via an early boolean -> false) unlikely. Though now that you bring it up respecting integer weights is probably net-negative due to changing draw size, and I agree that aborting rather than rejection sampling OOB ints is a good idea, though am 50/50 on if that should live here or in an atheris provider.

warmstarting atheris

If the suggested workflow running hypothesis to generate a corpus and then separately running pure atheris with that corpus, then idt inverting FDP is possible, because the user may write any of many possible atheris strategies by combining Consume* methods. I think this only makes sense if we use the hypothesis strategy for both, in which case FDP is out of the picture. We could use HypothesisBackend to warmstart BytestringProvider, making inversion trivial, but requires handing more control over the atheris setup to hypothesis rather than the user redirecting atheris to fuzz_one_input. (probably need to start a warmstart subprocess before killing & continuing with the main one, or pass a custom mutator to atheris.Setup which uses HypothesisBackend for first-n).

@Zac-HD
Copy link
Member

Zac-HD commented Jan 6, 2025

Yeah, I could deal with drawing up to one byte for use as a boolean, but would probably want to resolve this via some kind of mini-benchmark.

Conceptually we're not inverting the FDP, only our provider - it'd look like bytes -> atheris.FuzzedDataProvider -> hypothesis.extra.atheris.AtherisProvider -> Sequence[ChoiceT] -> <the usual>, and then writing the Sequence[ChoiceT] -> bytes inverse function is actually practical (if we have all the choice kwargs, anyway).

@tybug tybug requested a review from DRMacIver as a code owner January 12, 2025 21:36
@tybug
Copy link
Member Author

tybug commented Jan 12, 2025

The results for lists are more or less what you'd expect - by default our p_continue is around 0.83, so the distribution falls out as standard geometric with either p=0.83 (drawing up to 8 bits) or p=0.5 (drawing only one bit), with the latter generating half as empty lists. I expect this to generalize to mutation, since any mutation on a p=0.5 boolean has a 50% chance to truncate the list at that position.

c = Counter()
@given(st.lists(st.integers()))
def f(l):
    c[len(l)] += 1

for _ in range(100):
    b = random.randbytes(1000)
    f.hypothesis.fuzz_one_input(b)
print(c)
# current
Counter({2: 15, 0: 11, 3: 9, 1: 9, 5: 8, 4: 7, 8: 6, 13: 6, 7: 5, 11: 4, 6: 3, 12: 2, 20: 2, 14: 2, 18: 2, 17: 2, 15: 2, 21: 1, 25: 1, 10: 1, 22: 1, 9: 1})
# return self._draw_bits(1) == 0
Counter({0: 45, 1: 32, 2: 8, 3: 7, 4: 5, 5: 2, 7: 1})

Conceptually we're not inverting the FDP, only our provider - it'd look like bytes -> atheris.FuzzedDataProvider ->

I'm just not sure why FDP is involved - Atheris.setup(test_f.hypothesis.fuzz_one_input) doesn't ever actually touch an FDP, because the FDP is only for people writing their own structured generators for a property, which they're not doing because they're using the hypothesis strategy instead.

We could use HypothesisProvider to generate n initial inputs, then invert those to a buffer for BytestringProvider. But again this requires some way to write the buffer back to the atheris corpus. The entire loop is controlled by atheris; atheris generates a buffer (or takes one from the corpus), mutates it, and hands it to BytestringProvider via fuzz_one_input. The only hook we have outside of this is writing to the corpus dir before atheris runs (and possibly during if libfuzzer reloads). That doesn't require inverting FDP though; atheris will take a bytestring, mutate it, and hand it directly to us. I think we could provide a fuzz_with_atheris hook that does this warmstarting in a temp dir, or even saves the warmstarting to our db.

I've added a dependency on #4241 for a clean merge.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 13, 2025

Two intuitions:

  • The raw generation probabilities don't matter much, because this is going to be driven by a mutational fuzzer.
  • We want a simple and consistent bytes -> values mapping, ideally one the fuzzer is familiar with, because that makes the mutation operators more efficient / less redundant / etc.

I think this points pretty strongly towards using the FDP to implement our provider, and then also implementing fuzz_with_atheris eventually with all the extra stuff. Our inverter could also grab HypoFuzz's covering corpus for further use with Atheris, too...

@tybug
Copy link
Member Author

tybug commented Jan 13, 2025

using the FDP to implement our provider

I'm just not sure how this could be done! The kwargs of our typed choice sequence interface is richer than FDP supports, and once you tack on implementing all the kwargs with atomic FDP Consume* functions, I'd argue you end basically where this pull is anyway. ConsumeString has no parameter for min or max size, eg.

I agree with both of your points - but I would still advocate for 8- instead of 1-bit booleans, on the grounds that the byte size is still constant and that it helps the fuzzer produce and maintain good collection sizes (sure, by a probably-constant factor, but I'll take it). I'm happy to ignore integer weights completely. Outside of that, I'm not sure anything else can or should be made closer to FDP? float_to_lex in draw_float certainly isn't well understood by libfuzzer (et al), but I don't think atheris' approach is very amenable to fuzzing either: they multiply (-sys.float_info.max, sys.float_info.max) by a 64-bit float between 0 and 1.

@Zac-HD
Copy link
Member

Zac-HD commented Jan 13, 2025

Yeah, fair enough. I was thinking that'd we'd make very liberal use of early-exit, but practically speaking we'll always need some internal retries too...

@tybug
Copy link
Member Author

tybug commented Jan 16, 2025

Just fyi that as the last consumer of for_buffer and cd.bytes, this is the next blocker for the typed choice sequence. I definitely don't say that to hurry your review! Just making sure you know that this PR is on the main path and not the sidelines.

If you still have reservations about the bytestring provider then lmk. I'd be open to doing more stupidly-simple things now (cast 8 bytes directly to a float via f64 bit representation, no float_to_lex or range scaling) now, and benchmarking improvements later, if you're skeptical. I don't want to spend a lot of time benchmarking different approaches in a full fuzzing integration now, but we can leave that for later. The most important thing rn is getting fuzz_one_input off the buffer.

Copy link
Member

@Zac-HD Zac-HD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's ship this and we can revisit later if/when people complain 🙂

bits = min(bits, 8)

size = 2**bits
falsey = math.floor(size * (1 - p))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
falsey = math.floor(size * (1 - p))
falsey = max(1, math.floor(size * (1 - p)))

don't want to round low probabilities to zero!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also realized checking this that the bits calculation is very biased now that we don't recurse. I've replaced it with a constant 8 bits, which I believe is equivalent for p = {0.5, 0.75, ...} (only (top n) MSB matters <-> only draw_bit(n) matters) and is closer to the passed p otherwise.

@tybug tybug merged commit 2ce4344 into HypothesisWorks:master Jan 18, 2025
47 checks passed
@tybug tybug deleted the atheris-tcs branch January 18, 2025 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants