Corpus pruning for some fuzzers failing, corpuses growing unbounded |
||||||||
Issue descriptionE.g. v8_wasm_code_fuzzer See error.log attached. corpus on gs://clusterfuzz-corpus/libfuzzer/v8_wasm_code_fuzzer. looks like our automatic quarantining does not find this oom unit, but this only happens when running the collective merge on all units. Kostya, Mike - any suggestions here.
,
Oct 11 2016
We currently have 3 gb rss limit, and overall 3.7 gb on bots. Increasing that might not work, we will hit swap pretty quickly.
,
Oct 11 2016
Another possibility is to merge using a plain non-asan build which consumes less RAM. (assuming the corpus is free of memory corruption bugs)
,
Oct 11 2016
This is pretty much bug 615191 (sorry, missed the recent comments on that!) which I didn't have a good solution for. We do add a +1GB limit to the merge process over the individual unit runs, but that doesn't seem to be enough. Based on some debugging a while back with the pdf jpx fuzzer, it didn't appear to be because of a leak that wasn't detected for any particular testcases. It seemed like the merge process can hold on to freed memory (contributing to rss limit) for a particular unit across a few different unit runs, but eventually the rss limit goes back down to expected. Could this be a quirk with the allocator or how rss limit is calculated?
,
Oct 11 2016
we get rss from getrusage (ru_maxrss). BTW, if you are running the asan process you are also paying the price for quarantine, which grows over time (up to some limit). ASAN_OPTIONS=quarantine_size_mb=10 may help. Also, when it comes to anything with v8 in the name I am not sure. Some of the v8 targets I've seen have their own internal life (garbage collection?)
,
Oct 12 2016
Some more issues. 1. ASAN_OPTIONS=detect_leaks=1 overrides -detect_leaks=0 in corpus merge step. Try with web_icon_sizes_fuzzer. Fixing CF to disable this in ASAN_OPTIONS as well. 2. We don't sync our corpus after we process bad units/quarantining. So, then we do merge step and if that fails, we raise exception and all previous work on processing bad units/quarantining is lost. Fixing CF to add redundant step of syncing corpus after quarantinng. Other remaining bugs 1. And then for c#4,c#5, need to find a good workaround to make merging not fail. 2. like web_icon_sizes_fuzzer is leaking on on shutdown on every testcase (with ASAN_OPTIONS=detect_leaks=1), is that a false positive, or should we disable those leaks. 3. During fuzzing, we always turn on leak detection, should we randomize and sometimes disable it. If a fuzzer continues hitting leak, we can move forward, need thought. this is e.g. happening with pdfium fuzzer as well.
,
Oct 13 2016
>> ASAN_OPTIONS=detect_leaks=1 overrides -detect_leaks=0 -detect_leaks=0 does not affect leak detection at exit. It only disables leak detection during the run: detect_leaks 1 If 1, and if LeakSanitizer is enabled try to detect memory leaks during fuzzing (i.e. not only at shut down).
,
Oct 14 2016
Thanks Kostya. Issues fixed so far. 1. ASAN_OPTIONS=detect_leaks=0 as well during corpus merge. We rely on merge to finish properly so that it writes corpus and returns with 0 exit code. 2. We sync corpus redundantly after quarantining bad units and before merge. This way our corpus does not grow unbounded if merge exceptions out. 3. There was a bug in corpus pruning that a leak detected testcase didn't get added to global blacklist and hence it always remained as a blocker. From now on, one leak shouldn't block fuzzing for newer crashes (just like for regular fuzzers). still needs verification. 4. I explicitly added merge step to use 16 byte redzone instead of default 32, lets see if it helps with oom issues. More issues left still to analyze.
,
Oct 14 2016
Fixes landed include:: https://chromereviews.googleplex.com/526957013/ https://chromereviews.googleplex.com/521127013/ More bugs found:: if a fuzzer crashes on leaks instantly, that testcase will be a zero byte testcase. and apparently we use testcase length to decide where to fuzz or where to reproduce, so that breaks CF, need to fix that to create reliable leak testcases. e.g. renderer_fuzzer.
,
Oct 14 2016
>> if a fuzzer crashes on leaks instantly, that testcase will be a zero byte testcase. Hm... Shouldn't happen. Can you give details or repro?
,
Oct 14 2016
Verified with running renderer_fuzzer with detect_leaks=1. libFuzzer creates a 1 byte testcase. Looks like some clusterfuzz side issue, looking more. Also, another problem is some of these crazy 0 byte testcases are in corpus named empty_testcase, e.g. https://storage.cloud.google.com/clusterfuzz-corpus/libfuzzer/audio_decoder_ilbc_fuzzer_static/empty_testcase?authuser=0&_ga=1.40814176.352874260.1450287074. I am asking Oliver, Max how they sneaked in.
,
Oct 17 2016
Ok, clusterfuzz side leak checking had major problems (intern code :()). I fixed in https://chromereviews.googleplex.com/525207013/. That is why most of leak testcase were not reproducible, and no good bugs filed. On another note, many fuzzers leaking on shutdown on every testcase looks related to https://codereview.chromium.org/2402503002/#msg44. Oliver enabled is_lsan=true which sets LEAK_SANITIZER define which should hopefully fix these false positives https://codereview.chromium.org/2428643002/, https://codereview.chromium.org/2424893002/ Need to check stats tmrw for results. Right now, these show "No coverage" on https://cluster-fuzz.appspot.com/fuzzerstats?fuzzer_name=all&job_type=libfuzzer_chrome_asan&last_n=2&last_n_type=days&group_by=fuzzer
,
Oct 18 2016
,
Oct 18 2016
Many fuzzers are fixed with Oliver's change. I fixed another leak suppressions issue in https://chromereviews.googleplex.com/524317013. Latest bad bug is our libFuzzer leaks testcases are unreproducible because we had a big -runs=65535 argument and leak detection does not happen between runs. Kostya said to file a bug and should be easy fix. filed crbug.com/657088 . Till then 100 runs workaround is fine (takes 5 sec for e.g. pdfium), otherwise testcase is better to be marked unreproducible.
,
Oct 19 2016
Phew, leaks problems are now resolved (fixes in crbug.com/657088 , thanks Kostya!+cf workaround), see reproducible leaks https://cluster-fuzz.appspot.com/?search=project:chromium%20-leak#testcases and we will add them automatically to global blacklist (and remove when needed to reproduce). Now, next set of problems (and hopefully last ones remaining) is 1. out of memory. in this case, oom only happens when running corpus collectively for merge, so single unit quarantining does not work. redzone=16 (smallest)+quarantine_size_mb=0 does not help either. e.g. libfuzzer_pdf_jpx_fuzzer with corpus on gs://clusterfuzz-corpus/libfuzzer/pdf_jpx_fuzzer/ 2. some fuzzers like angle_translator_fuzzer are very slow, their corpus is growing beyond control, and neither single quarantining completing, and even just doing merge is failing on slow units. need more investigation.
,
Oct 21 2016
When a fuzzer fails corpus pruning task (bad unit quarantining, merging), the next time it runs, it shrinks corpus down to 10000 units to prevent corpus from growing unbounded. We will see if this fixes any of the fuzzer failing below Ok, remaining failing down to 4. Logs of three are enclosed. Angle_translator_fuzzer is just timing out on processing bad unit stage, need to add some timeout there.
,
Oct 21 2016
,
Oct 21 2016
,
Oct 23 2016
,
Oct 23 2016
Ok after c#16 workaround, more fuzzers like icu_ucasemap_fuzzer, v8_wasm_code_fuzzer, angle_translator_fuzzer corpus shrink from their initial massive size and now everyday they are running fine, not jumping too much, and merging is working. By working, finding new bugs too, e.g. https://bugs.chromium.org/p/chromium/issues/detail?id=658555 Remaining ones are ooms tracked by their individual bugs, see c#18, c#19. v8_serialized_script_value_fuzzer pdf_jpx_fuzzer
,
Oct 26 2016
All fixed, v8 wasm sometimes ooms, will file a bug. |
||||||||
►
Sign in to add a comment |
||||||||
Comment 1 by kcc@google.com
, Oct 11 2016