Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shut down if all initial inputs crash #276

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

ThorbjoernSchulz
Copy link

Regarding #262

The problem:
If a Fuzz function crashes on every initial input provided, the fuzzer
still executes and posts stats to stdout. However, since no inputs
are provided to the workers, nothing meaningful happens.
The behavior is confusing because of two points:

  1. No proper logging is done. The coordinator continues to execute and
    posts stats.
  2. Crashing inputs in general are allowed. The problem arises only if
    all initial inputs crash.

The problem in more detail:
The coordinator maps the initial corpus into memory. If no corpus is
provided the coordinator creates a default input (the empty input).
Workers then access the corpus over the hub, not over the coordinator.
Before this happens they are given the initial corpus files and are
testing them. If the Fuzz function crashes on these inputs, they will
never be passed to the hub. Therefore they are not available to the
workers after this initial stage.

The proposed solution:
After the initial triage stage we wait a short amount of time for
synchronization purposes. Then we check if the corpus is still empty.
If it is we conclude that the target crashed on every input and
shut down the fuzzer with the respective error message.

Please comment if the solution lacks in anything.

If a Fuzz function crashes on every initial input provided, the fuzzer
still executes and posts stats to stdout. However, since no inputs
are provided to the workers, nothing meaningful happens.
The behaviour is confusing because of two points:
1. No proper logging is done. The coordinator continues to execute and
   posts stats.
2. Crashing inputs in general are allowed. The problem arises only if
   all initial inputs crash.

The problem in more detail:
The coordinator maps the initial corpus into memory. If no corpus is
provided the coordinator creates a default input (the empty input).
Workers then access the corpus over the hub, not over the coordinator.
Before this happens they are given the initial corpus files and are
testing them. If the Fuzz function crashes on these inputs, they will
never be passed to the hub. Therefore they are not available to the
workers after this initial stage.

The proposed solution is as follows:
After the initial triage stage we wait a short amount of time for
synchronization purposes. Then we check if the corpus is still empty.
If it is we conclude that the target crashed on every input and
shut down the fuzzer with the respective error message.
@klauspost
Copy link

I tried out your patch, and things still stops processing, but quite randomly.

However the more workers I run, the more likely it seems to be that there is a lockup. Weird.

Restarting, does sometimes allow it to progress. No critique, but I am not this this fixes all cases of #262

3 executions:

 go-fuzz -v=0 -minimize=5s -bin=fuzz-build.zip -workdir=corpus -procs=32
2019/12/17 16:24:17 workers: 32, corpus: 5203 (2s ago), crashers: 0, restarts: 1/0, execs: 0 (0/sec), cover: 0, uptime: 3s
2019/12/17 16:24:20 workers: 32, corpus: 5203 (5s ago), crashers: 0, restarts: 1/0, execs: 0 (0/sec), cover: 2009, uptime: 6s
2019/12/17 16:24:23 workers: 32, corpus: 5203 (8s ago), crashers: 0, restarts: 1/6117, execs: 110121 (11814/sec), cover: 2009, uptime: 9s
2019/12/17 16:24:26 workers: 32, corpus: 5203 (11s ago), crashers: 0, restarts: 1/6117, execs: 110121 (8937/sec), cover: 2009, uptime: 12s
2019/12/17 16:24:29 workers: 32, corpus: 5203 (14s ago), crashers: 0, restarts: 1/6117, execs: 110121 (7187/sec), cover: 2009, uptime: 15s

 go-fuzz -v=0 -minimize=5s -bin=fuzz-build.zip -workdir=corpus -procs=32
2019/12/17 16:24:58 workers: 32, corpus: 5203 (3s ago), crashers: 0, restarts: 1/0, execs: 0 (0/sec), cover: 0, uptime: 3s
2019/12/17 16:25:01 workers: 32, corpus: 5203 (6s ago), crashers: 0, restarts: 1/0, execs: 0 (0/sec), cover: 2009, uptime: 6s
2019/12/17 16:25:04 workers: 32, corpus: 5203 (9s ago), crashers: 0, restarts: 1/5274, execs: 105488 (11322/sec), cover: 2009, uptime: 9s
2019/12/17 16:25:07 workers: 32, corpus: 5203 (12s ago), crashers: 0, restarts: 1/5274, execs: 105488 (8565/sec), cover: 2009, uptime: 12s
2019/12/17 16:25:10 workers: 32, corpus: 5203 (15s ago), crashers: 0, restarts: 1/5274, execs: 105488 (6887/sec), cover: 2009, uptime: 15s
2019/12/17 16:25:13 workers: 32, corpus: 5203 (18s ago), crashers: 0, restarts: 1/5274, execs: 105488 (5759/sec), cover: 2009, uptime: 18s

 go-fuzz -v=0 -minimize=5s -bin=fuzz-build.zip -workdir=corpus -procs=32
2019/12/17 16:25:30 workers: 32, corpus: 5209 (2s ago), crashers: 0, restarts: 1/0, execs: 0 (0/sec), cover: 0, uptime: 3s
2019/12/17 16:25:33 workers: 32, corpus: 5209 (5s ago), crashers: 0, restarts: 1/0, execs: 0 (0/sec), cover: 2009, uptime: 6s
2019/12/17 16:25:36 workers: 32, corpus: 5209 (8s ago), crashers: 0, restarts: 1/3750, execs: 112522 (12078/sec), cover: 2009, uptime: 9s
2019/12/17 16:25:39 workers: 32, corpus: 5209 (11s ago), crashers: 1, restarts: 1/4893, execs: 342542 (27811/sec), cover: 2009, uptime: 12s
2019/12/17 16:25:42 workers: 32, corpus: 5209 (14s ago), crashers: 1, restarts: 1/5554, execs: 622050 (40613/sec), cover: 2009, uptime: 15s
2019/12/17 16:25:45 workers: 32, corpus: 5209 (17s ago), crashers: 1, restarts: 1/6328, execs: 873388 (47684/sec), cover: 2009, uptime: 18s
2019/12/17 16:25:48 workers: 32, corpus: 5209 (20s ago), crashers: 1, restarts: 1/6785, execs: 1092409 (51248/sec), cover: 2009, uptime: 21s
[... keeps running]

In some of the cases where execution stops one worker is running at full speed, but nothing appears to happen.

@ThorbjoernSchulz
Copy link
Author

Thank you for the feedback. Can you share the fuzz function you used? I will try to reproduce this tomorrow and look further into it.

@klauspost
Copy link

klauspost commented Dec 17, 2019

@ThorbjoernSchulz

go get -u github.com/klauspost/simdjson-fuzz
go get -u github.com/fwessels/simdjson-go
cd $GOPATH$/src/github.com/klauspost/simdjson-fuzz
go-fuzz-build -o=fuzz-build.zip -func=Fuzz .
go-fuzz -minimize=5s -bin=fuzz-build.zip -workdir=corpus -procs=16

Using verbose output it ends up with something like this:

2019/12/17 18:36:47 workers: 16, corpus: 5286 (2s ago), crashers: 24, restarts: 1/0, execs: 0 (0/sec), cover: 0, uptime: 3s
2019/12/17 18:36:47 hub: corpus=358 bootstrap=358 fuzz=0 minimize=0 versifier=0 smash=0 sonar=0
2019/12/17 18:36:47 worker 0: triageq=33 execs=17269 mininp=17227 mincrash=0 triage=42 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:47 worker 1: triageq=6 execs=6463 mininp=6452 mincrash=0 triage=12 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:47 worker 3: triageq=1 execs=2467 mininp=2465 mincrash=0 triage=3 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:47 worker 4: triageq=0 execs=14400 mininp=14391 mincrash=0 triage=9 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:47 worker 2: triageq=9 execs=4801 mininp=4670 mincrash=0 triage=132 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:47 worker 5: triageq=5 execs=15234 mininp=15216 mincrash=0 triage=18 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:47 worker 9: triageq=3 execs=6651 mininp=6631 mincrash=0 triage=21 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:47 worker 8: triageq=0 execs=4350 mininp=4180 mincrash=0 triage=171 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:48 worker 6 triages coordinator input [24][56 4 134 248 50 207 33 67 208 107 249 95 120 27 102 230 89 83 77 157] minimized=true smashed=false
2019/12/17 18:36:48 worker 6: triageq=0 execs=3825 mininp=1320 mincrash=0 triage=2506 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:48 worker 10 triages coordinator input [7][131 169 22 178 212 36 154 81 152 182 248 40 29 202 223 43 254 252 190 50] minimized=true smashed=false
2019/12/17 18:36:48 worker 10: triageq=5 execs=3850 mininp=1420 mincrash=0 triage=2431 fuzz=0 versifier=0 smash=0 sonar=0 hint=0
2019/12/17 18:36:48 testee:
2019/12/17 18:36:48 testee:
2019/12/17 18:36:49 testee:
2019/12/17 18:36:49 testee:
2019/12/17 18:36:49 testee:
2019/12/17 18:36:49 testee:
2019/12/17 18:36:49 testee:
2019/12/17 18:36:50 workers: 16, corpus: 5286 (5s ago), crashers: 24, restarts: 1/0, execs: 0 (0/sec), cover: 2005, uptime: 6s
2019/12/17 18:36:50 hub: corpus=358 bootstrap=358 fuzz=0 minimize=0 versifier=0 smash=0 sonar=0
2019/12/17 18:36:53 workers: 16, corpus: 5286 (8s ago), crashers: 24, restarts: 1/6100, execs: 79310 (8500/sec), cover: 2005, uptime: 9s
2019/12/17 18:36:53 hub: corpus=358 bootstrap=358 fuzz=0 minimize=0 versifier=0 smash=0 sonar=0
2019/12/17 18:36:56 workers: 16, corpus: 5286 (11s ago), crashers: 24, restarts: 1/6100, execs: 79310 (6432/sec), cover: 2005, uptime: 12s
...

Adding more workers (I have 32 threads) seems to increase the chance of a deadlock.

And yes, the code does panic fairly often... The tested package is commit 56fc5c4ff6bb831811b4640aecdb998edebe545a (currently master)

go-fuzz/worker.go Outdated Show resolved Hide resolved
The break statement accidentally slipped into the wrong block.
@ThorbjoernSchulz
Copy link
Author

Looking into the CI results, the errors do not stem from these changes, do they?

// If the corpus is empty at this point, we have nothing to feed to our workers.
if x == 1 {
// Give the hub time to store the initial inputs
time.Sleep(100 * time.Millisecond)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am reading this correctly, we can fail spuriously if hub is delayed a bit (overloaded machine, virtualization, etc).
Also we don't generally ask user to add anything to corpus, fine to start with an empty one. Also if just part of inputs crash, we will continue working, this radical difference in behavior between part crash and all crash feels wrong. Also the corpus may contain, say, just 1 input and it happened to crash after code update. Then fuzzer won't start.
I agree the current behavior is bad. But I think a more useful way to fix this would be to let workers work even with empty corpus. They could generate random completely random inputs, it does not seem we can do anything better in absence of any seeds.
Then if just part of inputs crash (and we were unlucky with corpus, I think I saw cases where the initial empty input crash, maybe due to a bug in the fuzz function itself), the fuzzer will gracefully recover. Otherwise, the crash count will go up infinitely, which is, well, the expected behavior in such case.

It may be a bit tricky to preserve this part then:

		if len(ro.corpus) == 0 {
			// Some other worker triages corpus inputs.
			time.Sleep(100 * time.Millisecond)
			continue
		}

because we want to wait for hub for initial inputs, but not if there are really 0 inputs. I think this bit has to do with corpus inflation problem mentioned above:

			// Other workers are still triaging initial inputs.
			// Wait until they finish, otherwise we can generate
			// as if new interesting inputs that are not actually new
			// and thus unnecessary inflate corpus on every run.
			time.Sleep(100 * time.Millisecond)

But I hope there is some way to preserve it.

@dvyukov
Copy link
Owner

dvyukov commented Mar 18, 2020

Yes, the CI failure looks unrelated. There seems to be some change in darwin setup on travis.

@dvyukov
Copy link
Owner

dvyukov commented Mar 18, 2020

Yes, the CI failure looks unrelated. There seems to be some change in darwin setup on travis.

This should be fixed now: be3528f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants