New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 702791 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Apr 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug



Sign in to add a comment

investigate tests on swarming timeout

Project Member Reported by no...@chromium.org, Mar 17 2017

Issue description

Many linux_chromium_chromeos_rel_ng builds fail with infra failure because browser_tests running swarming timeout

build: https://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/386138

test shard:
https://chromium-swarm.appspot.com/task?id=34f6b02e31ff0b10&refresh=10&show_raw=1
were running for an hour

last lines in the task log:
Still waiting for the following processes to finish:
	./browser_tests --brave-new-test-launcher --gtest_also_run_disabled_tests --gtest_filter=WebViewTests/WebViewNewWindowTest.Shim_TestNewWindowNoPreventDefault/1 --single_process --test-launcher-bot-mode --test-launcher-summary-output=/b/s/w/io5GldUr/output.json --user-data-dir=/b/s/w/ithCobeq/.org.chromium.Chromium.myj012/dg3wbKC
 

Comment 1 by no...@chromium.org, Mar 17 2017

Cc: dpranke@chromium.org
Dirk, are you aware of any changes in browser test that might cause this?

Comment 2 by jam@chromium.org, Mar 17 2017

Also on Windows, i.e. my job here is pending for > 30 minutes https://chromium-swarm.appspot.com/task?id=34f6f49e4f5eeb10&refresh=10

Comment 3 by no...@chromium.org, Mar 17 2017

I think browser tests takes more time => task duration increased => swarming capacity is not enough => all swarming tasks using the same pool of machines is affected.

Comment 4 by jam@chromium.org, Mar 17 2017

Do we have any evidence that browser_tests suddenly started taking more time?

Comment 5 by no...@chromium.org, Mar 17 2017

rate of HTTP 500s on swarming elevated http://shortn/_iOGvBQjBWv
(we should have an alert on this)

Comment 6 by no...@chromium.org, Mar 17 2017

... => # of pending tasks is too high => task query takes too much time => http request timeout of 1min is hit => http 500
Run time of each swarming test on linux_chromium_rel_ng: http://shortn/_HuFQkyFGUu

Comment 8 by no...@chromium.org, Mar 17 2017

something happened at about 14:20 PDT

Comment 9 by no...@chromium.org, Mar 17 2017

http://shortn/_oupxZGZ4T9 is only browser tests

Comment 10 by no...@chromium.org, Mar 17 2017

there were no swarming deployments today http://shortn/_QxzpD1fNiV

Comment 11 by no...@chromium.org, Mar 17 2017

Cc: jam@chromium.org
I don't see a problem with infrastructure, so, as a trooper, I am not sure what to do here besides escalation. jam, dpranke?

Comment 12 by no...@chromium.org, Mar 17 2017

Cc: mar...@chromium.org
Labels: -Pri-0 Pri-1
It seems the outage has been resolved: http://shortn/_anRPIjnlww

Test times are back down and current tryjobs don't seem to be purple.
I believe rcabebfee33715afb9e577f6d133898b408390ab1 might have been the breaking change. It's since been reverted. (Though the timing doesn't really line up, so it could have been something else.)

Comment 15 by jam@chromium.org, Mar 18 2017

This was reverted 4 hours ago but Windows and Mac try runs are still timing out on swarming jobs.
The test runtimes of both mac_chromium_rel_ng and win_chromium_rel_ng returned to normal at about the same time as linux_chromium_rel_ng:
mac_chromium_rel_ng: http://shortn/_qFyySetpb1
win_chromium_rel_ng: http://shortn/_rfaDsNPXwp

Any current expiring jobs could just be blowback from that short period of time when nearly everything was expiring. Is there a particular trybot that's still affected?
Status: Fixed (was: Untriaged)
Looks like this outage was resolved source-side.
Cc: katthomas@chromium.org
Components: Infra>Platform>Swarming Infra>Monitoring
Labels: -Infra-Troopers
Owner: dpranke@chromium.org
Status: Assigned (was: Fixed)
I'm going to reopen this a bit to do some follow-up, but we can take this out of the trooper queue.

It sounds like this actually went reasonably well in terms of handling a src-induced overload, and so I think it might be interesting to look at it for places to improve.

nodir@ - how did you notice this? Did we get step failures alerts, or did someone ping you, or something else?

Regarding the comments in #5-#6, maruel@, are the queries taking too long expected behavior? Should we be able to shed load better so that queries complete in a timely manner?

Did we get alerts on pending times for tasks being too long? If not, do we need to adjust thresholds there? 

Do we know what the problem with the offending CL was? Was it hanging the task?

Do we know how quickly we ran out of capacity, or what the performance characteristics are like to know how quickly we should've recovered and how well we did?
Summary: investigate tests on swarming timeout (was: tests on swarming timeout)
Oh, also, were we planning to write a postmortem for this?

Comment 20 by no...@chromium.org, Mar 20 2017

Cc: sergeybe...@chromium.org
I received pages for high infra failure rate in some builders, see tickets in o/369081. Then I've received an alert for high 500x on Swarming (which was caused by long pending queue, caused by the same thing)

There were no alerts from metrics mentioned in #7. I am sure such alerts would be useful because those metrics are precomputations with a very large window (2h). I think stip@ set these metrics up, but bpastene should know better.

I didn't dig into Chromium side of this outage, e.g. what caused the timeout, was it slowdown or hang, etc. I think I don't have enough domain-specific knowledge.

> Do we know how quickly we ran out of capacity, or what the performance characteristics are like to know how quickly we should've recovered and how well we did?
here is a graph of pending tasks http://shortn/_YUk933miHf
I think "active jobs" is close to "number of bots".
at ~1pm pending queue started to grow
at ~2pm it was so long, tasks started to expire
i interpret the graph is that at ~6pm service got back to normal, but maruel@ or sergeyberezin@ could tell better

---

I think it would be useful to set up few more swarming alerts:
- # of pending tasks, i.e. http://shortn/_feQCpt0Mk8
- rate of task expiration

Comment 21 by no...@chromium.org, Mar 20 2017

i didn't plan to write a postmortem

From my perspective the outcome of this postmortem is:
- alerts on InfraFailures worked
- we need more alerts on swarming to detect the problem even earlier. # of pending tasks would catch it early
- it would be great if Swarming had a button "EMERGENCY add more bots" (like we have "emergency add quota" in AppEngine), but it is possible only for Linux (thanks to MachineProvider), but not Macs/Windows I think.
The long pending queue leading to sluggish performance is being worked on. That's my main project at the moment.
Owner: no...@chromium.org
Status: Fixed (was: Assigned)

Sign in to add a comment