New issue
Advanced search Search tips

Issue 679825 link

Starred by 1 user

Issue metadata

Status: Untriaged
Owner: ----
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Mac
Pri: 3
Type: Bug



Sign in to add a comment

chromium.perf.fyi/mac-test-retina benchmarks failing with huge JSON output from tests

Project Member Reported by charliea@chromium.org, Jan 10 2017

Issue description

Link: https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf.fyi%2FMac_Test_Retina_Perf%2F403%2F%2B%2Frecipes%2Fsteps%2Fbattor.power_cases%2F0%2Fstdout

Loading that STDOUT, the page gets *very slow*, and you can see the data being sent to the perf dashboard is huge. Here's an excerpt on gpaste: . You can see this for yourself by scrolling to the bottom to the STDOUT and scrolling to the right.

This doesn't seem to be happening on the chromium.perf/mac-retina bots. Maybe it has to do with the fact that the chromium.perf bot is swarmed, whereas the chromium.perf.fyi bot isn't?

It's hard to tell if this is only on BattOr benchmarks, but I suspect that it isn't: I expect that if there were other benchmarks running on this bot, they'd be failing too, as the failures don't seem BattOr-related at all. Strangely, the other two chromium.perf.fyi Mac bots (https://build.chromium.org/p/chromium.perf.fyi/builders/Mac%20Power%20Dual-GPU%20Perf and https://build.chromium.org/p/chromium.perf.fyi/builders/Mac%20Power%20Low-End%20Perf) seem to be churning along relatively happily. I wonder if someone is misconfigured particular to this bot?

eyaich@, does anything about this jump out to you?
 
Cc: -charliea@google.com

Comment 2 by eyaich@chromium.org, Jan 11 2017

Cc: sullivan@chromium.org
So nothing jumps out.  The Json output from the tests is solely based on the test itself so in theory it is generating the same data in both cases?  Have you compare the json output from the main waterfall bot against the fyi waterfall?

In theory, the json handling and upload is done on the buildbot bot that is triggering the jobs on swarming, so it shouldn't be a difference of swarming or not since this isn't technically done on a swarming bot.

ie https://luci-logdog.appspot.com/v/?s=chrome%2Fbb%2Fchromium.perf%2FMac_Retina_Perf%2F181%2F%2B%2Frecipes%2Fsteps%2Fbattor.steady_state_Dashboard_Upload%2F0%2Fstdout is the stdout for the upload on the main waterfall, done on the main bot.  I guess the only differnce is that the test logs are split out now since the test is run in swarming and the upload in buildbot, so maybe it is the combination of the logs and the json that is causing the problem?  But there does seem to be a substantially larger amount of json on those bots. 

I wonder if it has something to do with that bug where there is stale data on the bots when the upload fails. I feel like we saw failure on the main waterfall like this, it was a known issue with the perf dashboard that we happened to be hitting right around the time we started swarming migrations, I can't find the bug....I don't think I am making it up but it might be totally unrelated.  Maybe Annie knows what bug I am referring to?
Cc: eakuefner@chromium.org
So if a perf dashboard upload fails, the recipe will retry it, and currently someone needs to go into the bot and clear it out. That could be causing consistent failures after the JSON size is fixed.

+eakuefner is still looking for examples of large failed JSON uploads.

But as far as "should the JSON be this big?" I think you need to diff it against the main waterfall to investigate.
Okay, so in the process of trying to repro this, I realized something: it's _not_ the JSON from the battor cases that needs optimization, it's that there were these 254 failed uploads that were being retried on every build because it kept timing out.

Turns out, all of the failed JSON uploads that I could see were from the V8 runtime callstats benchmark, which _does_ produce a lot of data.

Also, it's green as of the latest build: https://build.chromium.org/p/chromium.perf.fyi/builders/Mac%20Test%20Retina%20Perf/builds/411 which has 149 changes in it, so if someone wants to scrub that to figure out why, that would be interesting.

Anyway, looks like the runtime callstats benchmark is the best possible case study for my purposes.
Cc: -eakuefner@chromium.org
Components: Test>Telemetry
Components: -Tests>Telemetry

Sign in to add a comment