New issue
Advanced search Search tips

Issue 771694 link

Starred by 2 users

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 3
Type: Bug

Blocking:
issue 539572



Sign in to add a comment

Timeout detection based on system_clock::now() can be unfair when running on an overloaded system

Project Member Reported by mmoroz@chromium.org, Oct 4 2017

Issue description

Let's continue discussion here. Ryan rharrison@ raised the following concern while working on  issue 770470 :

"Digging into the trace a bit to understand what was going on, I noticed that libFuzzer appears to be using time based on system_clock::now() for timestamps and timeouts. This combined with the fact in this specific case the fuzzer timed out in 36 seconds instead of something closer to 25 seconds, which is what the limit was set for, leads me to suspect at least some of the issues we are seeing are due to the timeouts being calculated using wall/real clock time instead of execution/CPU time. If the fuzzer process ends up sleeping/not running for extended amounts of time due to multitasking & load on the system then the actual time spent executing will be substantially lower then the wall clock time, so the timeout will fire early.

Does this sound like a reasonable hypothesis to explain why there are so many timeouts occurring?"


I replied that it sounded interesting and I was going to check.

The load average seems to be in range of 1.00 - 1.05 with 4 process running on average, e.g.:

$ cat /proc/loadavg
1.01 1.02 1.05 4/237 23977

# repeated multiple times:
1.00 1.01 1.05 5/237 23987
1.00 1.01 1.05 4/237 23988
1.00 1.01 1.05 4/237 23989
1.00 1.01 1.05 4/237 23990
1.00 1.01 1.05 4/237 23991
1.00 1.01 1.05 4/237 23992
1.00 1.01 1.05 4/237 24008
1.00 1.01 1.05 4/237 24009
1.00 1.01 1.05 4/237 24010
1.00 1.01 1.05 5/237 24011
1.00 1.01 1.05 4/237 24012
1.00 1.01 1.05 4/237 24013
1.00 1.01 1.05 4/237 24014
1.00 1.01 1.05 4/237 24015
1.00 1.01 1.05 4/237 24016
1.00 1.01 1.05 4/237 24017
1.03 1.04 1.05 4/237 24084
1.03 1.04 1.05 10/237 24085
1.03 1.04 1.05 5/237 24086
1.03 1.04 1.05 4/237 24087
1.03 1.03 1.05 4/237 24088
1.03 1.03 1.05 4/237 24089
1.03 1.03 1.05 4/237 24090
1.03 1.03 1.05 4/237 24091

Q: "Do we run multiple VMs on the same physical hardware? That would potentially be exasperating this issue even if the load on the individual VMs is low." 

A: Our VMs are running on Google Compute Engine. There is definitely some shared physical hardware, but I have no idea on how the hardware is shared.

And one more comment from Ryan:

"I think there is a fundamental issue here, since wall time isn't a great analogue for how much resources a process has consumed, since it advances even when a process isn't running. The timeout is effectively a fail safe to make sure that a test case doesn't consume too much resources, but it will kill unoffending processes to make sure no offending ones get through. I really wish the C++ standard had a CPU time timestamp mechanism that could just be dropped in, since I think cpu time spent is a much better analogue/load independent. To my recollection CPU time APIs tend to be OS dependent."
 
Cc: dsinclair@chromium.org
Blocking: 539572

Comment 3 by kcc@chromium.org, Oct 4 2017

CPU time is also a bad thing to measure, e.g. if we have a IO-related latency, 
or a deadlock, or just a sleep in the code, CPU time will be low, but the timeout will be real. 

Yes, the described problem exists when we are fuzzing on an overloaded system.
But when we fuzz on a single-core dedicated VM this should not be a big issue. 

Blocking: 770470
Blocking: -770470

Sign in to add a comment