investigate tests on swarming timeout |
|||||||||
Issue descriptionMany linux_chromium_chromeos_rel_ng builds fail with infra failure because browser_tests running swarming timeout build: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/386138 test shard: https://chromium-swarm.appspot.com/task?id=34f6b02e31ff0b10&refresh=10&show_raw=1 were running for an hour last lines in the task log: Still waiting for the following processes to finish: ./browser_tests --brave-new-test-launcher --gtest_also_run_disabled_tests --gtest_filter=WebViewTests/WebViewNewWindowTest.Shim_TestNewWindowNoPreventDefault/1 --single_process --test-launcher-bot-mode --test-launcher-summary-output=/b/s/w/io5GldUr/output.json --user-data-dir=/b/s/w/ithCobeq/.org.chromium.Chromium.myj012/dg3wbKC
,
Mar 17 2017
Also on Windows, i.e. my job here is pending for > 30 minutes https://chromium-swarm.appspot.com/task?id=34f6f49e4f5eeb10&refresh=10
,
Mar 17 2017
I think browser tests takes more time => task duration increased => swarming capacity is not enough => all swarming tasks using the same pool of machines is affected.
,
Mar 17 2017
Do we have any evidence that browser_tests suddenly started taking more time?
,
Mar 17 2017
rate of HTTP 500s on swarming elevated http://shortn/_iOGvBQjBWv (we should have an alert on this)
,
Mar 17 2017
... => # of pending tasks is too high => task query takes too much time => http request timeout of 1min is hit => http 500
,
Mar 17 2017
Run time of each swarming test on linux_chromium_rel_ng: http://shortn/_HuFQkyFGUu
,
Mar 17 2017
something happened at about 14:20 PDT
,
Mar 17 2017
http://shortn/_oupxZGZ4T9 is only browser tests
,
Mar 17 2017
there were no swarming deployments today http://shortn/_QxzpD1fNiV
,
Mar 17 2017
I don't see a problem with infrastructure, so, as a trooper, I am not sure what to do here besides escalation. jam, dpranke?
,
Mar 17 2017
,
Mar 17 2017
It seems the outage has been resolved: http://shortn/_anRPIjnlww Test times are back down and current tryjobs don't seem to be purple.
,
Mar 17 2017
I believe rcabebfee33715afb9e577f6d133898b408390ab1 might have been the breaking change. It's since been reverted. (Though the timing doesn't really line up, so it could have been something else.)
,
Mar 18 2017
This was reverted 4 hours ago but Windows and Mac try runs are still timing out on swarming jobs.
,
Mar 18 2017
The test runtimes of both mac_chromium_rel_ng and win_chromium_rel_ng returned to normal at about the same time as linux_chromium_rel_ng: mac_chromium_rel_ng: http://shortn/_qFyySetpb1 win_chromium_rel_ng: http://shortn/_rfaDsNPXwp Any current expiring jobs could just be blowback from that short period of time when nearly everything was expiring. Is there a particular trybot that's still affected?
,
Mar 20 2017
Looks like this outage was resolved source-side.
,
Mar 20 2017
I'm going to reopen this a bit to do some follow-up, but we can take this out of the trooper queue. It sounds like this actually went reasonably well in terms of handling a src-induced overload, and so I think it might be interesting to look at it for places to improve. nodir@ - how did you notice this? Did we get step failures alerts, or did someone ping you, or something else? Regarding the comments in #5-#6, maruel@, are the queries taking too long expected behavior? Should we be able to shed load better so that queries complete in a timely manner? Did we get alerts on pending times for tasks being too long? If not, do we need to adjust thresholds there? Do we know what the problem with the offending CL was? Was it hanging the task? Do we know how quickly we ran out of capacity, or what the performance characteristics are like to know how quickly we should've recovered and how well we did?
,
Mar 20 2017
Oh, also, were we planning to write a postmortem for this?
,
Mar 20 2017
I received pages for high infra failure rate in some builders, see tickets in o/369081. Then I've received an alert for high 500x on Swarming (which was caused by long pending queue, caused by the same thing) There were no alerts from metrics mentioned in #7. I am sure such alerts would be useful because those metrics are precomputations with a very large window (2h). I think stip@ set these metrics up, but bpastene should know better. I didn't dig into Chromium side of this outage, e.g. what caused the timeout, was it slowdown or hang, etc. I think I don't have enough domain-specific knowledge. > Do we know how quickly we ran out of capacity, or what the performance characteristics are like to know how quickly we should've recovered and how well we did? here is a graph of pending tasks http://shortn/_YUk933miHf I think "active jobs" is close to "number of bots". at ~1pm pending queue started to grow at ~2pm it was so long, tasks started to expire i interpret the graph is that at ~6pm service got back to normal, but maruel@ or sergeyberezin@ could tell better --- I think it would be useful to set up few more swarming alerts: - # of pending tasks, i.e. http://shortn/_feQCpt0Mk8 - rate of task expiration
,
Mar 20 2017
i didn't plan to write a postmortem From my perspective the outcome of this postmortem is: - alerts on InfraFailures worked - we need more alerts on swarming to detect the problem even earlier. # of pending tasks would catch it early - it would be great if Swarming had a button "EMERGENCY add more bots" (like we have "emergency add quota" in AppEngine), but it is possible only for Linux (thanks to MachineProvider), but not Macs/Windows I think.
,
Mar 20 2017
The long pending queue leading to sluggish performance is being worked on. That's my main project at the moment.
,
Apr 16 2017
|
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by no...@chromium.org
, Mar 17 2017