New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 654868 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 2016
Cc:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocked on:
issue 657088
issue 657482
issue 658578

Blocking:
issue 539572



Sign in to add a comment

Corpus pruning for some fuzzers failing, corpuses growing unbounded

Project Member Reported by infe...@chromium.org, Oct 11 2016

Issue description

E.g. v8_wasm_code_fuzzer

See error.log attached. corpus on gs://clusterfuzz-corpus/libfuzzer/v8_wasm_code_fuzzer. looks like our automatic quarantining does not find this oom unit, but this only happens when running the collective merge on all units.

Kostya, Mike - any suggestions here.
 
error.log
29.5 KB View Download

Comment 1 by kcc@google.com, Oct 11 2016

Maybe just give a larger rss limit to the merge process? 
When we do merging, the overall memory consumption is larger than when running a single input.

Comment 2 by aarya@google.com, Oct 11 2016

Cc: wrengr@chromium.org tanin@chromium.org
We currently have 3 gb rss limit, and overall 3.7 gb on bots. Increasing that might not work, we will hit swap pretty quickly.

Comment 3 by kcc@google.com, Oct 11 2016

Another possibility is to merge using a plain non-asan build which consumes less RAM. 
(assuming the corpus is free of memory corruption bugs)

Comment 4 by och...@chromium.org, Oct 11 2016

This is pretty much bug 615191 (sorry, missed the recent comments on that!) which I didn't have a good solution for. 

We do add a +1GB limit to the merge process over the individual unit runs, but that doesn't seem to be enough. Based on some debugging a while back with the pdf jpx fuzzer, it didn't appear to be because of a leak that wasn't detected for any particular testcases.

It seemed like the merge process can hold on to freed memory (contributing to rss limit) for a particular unit across a few different unit runs, but eventually the rss limit goes back down to expected. Could this be a quirk with the allocator or how rss limit is calculated?

Comment 5 by kcc@google.com, Oct 11 2016

we get rss from getrusage (ru_maxrss). 
BTW, if you are running the asan process you are also paying the price for quarantine, which grows over time (up to some limit). 
ASAN_OPTIONS=quarantine_size_mb=10 may help. 

Also, when it comes to anything with v8 in the name I am not sure. 
Some of the v8 targets I've seen have their own internal life (garbage collection?)
Some more issues.

1. ASAN_OPTIONS=detect_leaks=1 overrides -detect_leaks=0 in corpus merge step. Try with web_icon_sizes_fuzzer. Fixing CF to disable this in ASAN_OPTIONS as well.
2. We don't sync our corpus after we process bad units/quarantining. So, then we do merge step and if that fails, we raise exception and all previous work on processing bad units/quarantining is lost. Fixing CF to add redundant step of syncing corpus after quarantinng.

Other remaining bugs
1. And then for c#4,c#5, need to find a good workaround to make merging not fail.
2. like web_icon_sizes_fuzzer is leaking on on shutdown on every testcase (with ASAN_OPTIONS=detect_leaks=1), is that a false positive, or should we disable those leaks.
3. During fuzzing, we always turn on leak detection, should we randomize and sometimes disable it. If a fuzzer continues hitting leak, we can move forward, need thought. this is e.g. happening with pdfium fuzzer as well.

Comment 7 by kcc@google.com, Oct 13 2016

>> ASAN_OPTIONS=detect_leaks=1 overrides -detect_leaks=0 
-detect_leaks=0 does not affect leak detection at exit.
It only disables leak detection during the run:

 detect_leaks                   1       If 1, and if LeakSanitizer is enabled try to detect memory leaks during fuzzing (i.e. not only at shut down).


Comment 8 by aarya@google.com, Oct 14 2016

Owner: infe...@chromium.org
Status: Assigned (was: Untriaged)
Thanks Kostya.

Issues fixed so far.
1. ASAN_OPTIONS=detect_leaks=0 as well during corpus merge. We rely on merge to finish properly so that it writes corpus and returns with 0 exit code.
2. We sync corpus redundantly after quarantining bad units and before merge. This way our corpus does not grow unbounded if merge exceptions out.
3. There was a bug in corpus pruning that a leak detected testcase didn't get added to global blacklist and hence it always remained as a blocker. From now on, one leak shouldn't block fuzzing for newer crashes (just like for regular fuzzers). still needs verification.
4. I explicitly added merge step to use 16 byte redzone instead of default 32, lets see if it helps with oom issues.

More issues left still to analyze.


Comment 9 by aarya@google.com, Oct 14 2016

Fixes landed include::
https://chromereviews.googleplex.com/526957013/
https://chromereviews.googleplex.com/521127013/

More bugs found::
if a fuzzer crashes on leaks instantly, that testcase will be a zero byte testcase. and apparently we use testcase length to decide where to fuzz or where to reproduce, so that breaks CF, need to fix that to create reliable leak testcases. e.g. renderer_fuzzer.

Comment 10 by kcc@google.com, Oct 14 2016

>> if a fuzzer crashes on leaks instantly, that testcase will be a zero byte testcase.


Hm... Shouldn't happen. Can you give details or repro? 

Comment 11 by aarya@google.com, Oct 14 2016

Verified with running renderer_fuzzer with detect_leaks=1. libFuzzer creates a 1 byte testcase. Looks like some clusterfuzz side issue, looking more. 

Also, another problem is some of these crazy 0 byte testcases are in corpus named empty_testcase, e.g. https://storage.cloud.google.com/clusterfuzz-corpus/libfuzzer/audio_decoder_ilbc_fuzzer_static/empty_testcase?authuser=0&_ga=1.40814176.352874260.1450287074. I am asking Oliver, Max how they sneaked in.
Ok, clusterfuzz side leak checking had major problems (intern code :()). I fixed in https://chromereviews.googleplex.com/525207013/. That is why most of leak testcase were not reproducible, and no good bugs filed.

On another note, many fuzzers leaking on shutdown on every testcase looks related to https://codereview.chromium.org/2402503002/#msg44. Oliver enabled is_lsan=true which sets LEAK_SANITIZER define which should hopefully fix these false positives https://codereview.chromium.org/2428643002/, https://codereview.chromium.org/2424893002/

Need to check stats tmrw for results. Right now, these show "No coverage" on https://cluster-fuzz.appspot.com/fuzzerstats?fuzzer_name=all&job_type=libfuzzer_chrome_asan&last_n=2&last_n_type=days&group_by=fuzzer
Blockedon: 657088
Many fuzzers are fixed with Oliver's change.
I fixed another leak suppressions issue in https://chromereviews.googleplex.com/524317013.

Latest bad bug is our libFuzzer leaks testcases are unreproducible because we had a big -runs=65535 argument and leak detection does not happen between runs. Kostya said to file a bug and should be easy fix. filed  crbug.com/657088 . 

Till then 100 runs workaround is fine (takes 5 sec for e.g. pdfium), otherwise testcase is better to be marked unreproducible.

Comment 15 by aarya@google.com, Oct 19 2016

Phew, leaks problems are now resolved (fixes in  crbug.com/657088 , thanks Kostya!+cf workaround), see reproducible leaks https://cluster-fuzz.appspot.com/?search=project:chromium%20-leak#testcases and we will add them automatically to global blacklist (and remove when needed to reproduce).

Now, next set of problems (and hopefully last ones remaining) is
1. out of memory. in this case, oom only happens when running corpus collectively for merge, so single unit quarantining does not work. redzone=16 (smallest)+quarantine_size_mb=0 does not help either. e.g. libfuzzer_pdf_jpx_fuzzer with corpus on gs://clusterfuzz-corpus/libfuzzer/pdf_jpx_fuzzer/
2. some fuzzers like angle_translator_fuzzer are very slow, their corpus is growing beyond control, and neither single quarantining completing, and even just doing merge is failing on slow units. need more investigation.
When a fuzzer fails corpus pruning task (bad unit quarantining, merging), the next time it runs, it shrinks corpus down to 10000 units to prevent corpus from growing unbounded. We will see if this fixes any of the fuzzer failing below

Ok, remaining failing down to 4. Logs of three are enclosed. Angle_translator_fuzzer is just timing out on processing bad unit stage, need to add some timeout there.


icu_ucasemap_fuzzer.log
8.1 KB View Download
v8_serialized_script_value_fuzzer.log
7.9 KB View Download
v8_wasm_code_fuzzer.log
13.1 KB View Download

Comment 17 by aarya@google.com, Oct 21 2016

Cc: jbroman@chromium.org
Blockedon: 657482
Blockedon: 658578
Ok after c#16 workaround, more fuzzers like icu_ucasemap_fuzzer, v8_wasm_code_fuzzer, angle_translator_fuzzer corpus shrink from their initial massive size and now everyday they are running fine, not jumping too much, and merging is working. By working, finding new bugs too, e.g. https://bugs.chromium.org/p/chromium/issues/detail?id=658555

Remaining ones are ooms tracked by their individual bugs, see c#18, c#19.
v8_serialized_script_value_fuzzer
pdf_jpx_fuzzer


Comment 21 by aarya@google.com, Oct 26 2016

Status: Fixed (was: Assigned)
All fixed, v8 wasm sometimes ooms, will file a bug.

Sign in to add a comment