Working on revision chromium@430551.Performance Test 5 of 5 Working on revision chromium@430551.Performance Test 5 of 5 ( 59 mins, 55 secs )
stdio cache
This is an issue with the test taking an hour with no output. There's not much we can do from the bisect side I see two solutions:
Implement some sort of heartbeat on run_benchmark to keep stdio alive and prevent buildbot from stopping the job.
Or, consult with infra to see if we could/should extend this timeout.
The challenge is that when this happens (step timeout due to no output) bisect code loses control of the execution, if I understand correctly, as the job goes directly to INFRA_FAILURE. And I don't think there's any indication in the buildbucket job status response that the reason for the failure is such time out, except for perhaps the INFRA_FAILURE error code, which is too generic.
https://chromeperf.appspot.com/buildbucket_job_status/8996215233985005552
at the end of the steps log:
command timed out: 3600 seconds without output, attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=50586.511405
Re #13: Yeah, I'm really confused on this one. At first glance, it looks like just a problem with the test. But then if you look at all the bisects (listed out in #11), the test very consistently takes 20-22 minutes to finish, and then all the sudden times out at 60s with no output. So I think it could be something going wrong on the device. Any ideas how to debug?
I'm not sure, but I'm wondering about runtest.py & whether it only dumps output at the end of execution. I think that output would be pretty useful here.
Agreed the output would be extremely useful. I know I have heard there is a discrepancy between the perf and bisect recipes, where the perf one at least outputs some kind of heartbeat so the test doesn't time out after an hour. Does anyone know more about that? dtu, eyaich, martiniss, rnephew?
Every 10 minutes (or there abouts) the test runner at build/android/test_runner.py says what each test shard is working on to avoid timing out without output.
Does everything use chartjson output format now? Basically, the "buildbot" output format requires intercepting stdout in the recipes. This is a problem because these bisect bots hit the 60 min buildbot timeout and buildbot sends the recipes a sigkill (killing the script before anything is done with the intercepted stdout).
It would be cleanest if I could just remove support for the "buildbot" output format.
This is specifically what I am talking about....
https://cs.chromium.org/chromium/build/scripts/slave/recipe_modules/bisect_tester/perf_test.py?q=use_buildbot&sq=package:chromium&l=95&dr=C
Anyone know if use_buildbot is True for any bots?
+eakuefner telemetry never uses buildbot output format, correct? It's been completely removed?
eyaich: do the c++ perftests still use buildbot output format?
Even if non-telemetry tests still use it, maybe the bisect could just drop buildbot for telemetry tests? I think telemetry tests are the vast majority of the ones which time out, although we may have seen what appears to be timeouts on cc_perftests on android (i think that is why we set verbose output in bug 632890 )
No, and yes, to your questions addressed to me.
I think it may have been long enough since we deleted it that dropping buildbot support in bisect for Telemetry tests is okay.
In crbug.com/666312 performance_browser_tests was still using buildbot format, I'm not clear if it can be configured to output anything else.
From what I can see, the bisect recipe only saves valueset or chartjson outputs at the moment, and compare_samples doesn't support buildbot output either.
Probably long term is to drop support for the format and get the tests outputting the new one, maybe in the short term we can parse buildbot in the bisect recipe and pass that on to compare_samples? The old bisect script had buildbot parsing code, wouldn't be hard to snip that out.
Have a CL that will not redirect stdout/stderr to a file if --output-format=buildbot is not specified. This will mean that, for the case where output format is not buildbot, we should get stdout streamed to the waterfall (and should be able to debug these timeout issues better)
https://chromium-review.googlesource.com/#/c/414297/
This is currently labelled as "Infra-Failures", but from reading the comments, it looks like this is a problem with src-side stuff (e.g., tests not producing the right kind of output), and there's nothing for infra to do here.
Am I correct, and, if so, should we remove the infra-failures label for this?
I think the root cause is a failure somewhere in the android infra stack as per #11: the test runs fine in ~20 minutes several times and then all the sudden it times out after an hour. But then the output redirection in our recipe makes this hard to debug (see #32). Should speed infra team take this back to change the recipe and then when we see this failure again file a new bug on Infra>Client>Android?
Note that this was happening with pretty regular frequency when the bug was filed, but we haven't seen it since Dec 2.
testing this change out on the staging_* bisect bots and seems to work as intended.,
https://chromium-review.googlesource.com/c/423387/
I think once I land this change on non-staging bisect bots we will at least get some output from these timeouts (that is, instead of hitting buildbot timeout and getting no output, we will probably hit some test timeout and get output on what is going on)
The buildbot timeout on the bot should now be fixed. However, it seems like the buildbot timeout may have been caused by a test timeout so i'll leave this bug open for now (we will now get test output when it times out.)
Comment 1 by robert...@chromium.org
, Nov 12 2016