Issue metadata
Sign in to add a comment
|
Investigate why triggering jobs on swarming sometimes takes 30 seconds + |
||||||||||||||||||||||
Issue descriptionFrom doing investigating of runs after deploying webkit layout tests on swarming we discovered that sometimes triggering a job on swarming takes a long time. Normally it completes in a couple of seconds (normally under 100ms!) but sometimes it can take 30 seconds +. We don't know how frequently this occurs, nor why it is occuring. Need to do some querying to figure it out.
,
Apr 24 2017
No need to set generic Infra components when it has a clear subcomponent already.
,
Apr 24 2017
I've looked at recent logs of slow /tasks/new, and they all have unexplainable stalls, like this one: https://screenshot.googleplex.com/bzGNOfw9UQR.png (a request comes in, does nothing for 30 sec, then hits the log line at the very start of the handler (before any expensive operations; what did it do for 30 sec?). This looks very similar to https://bugs.chromium.org/p/chromium/issues/detail?id=704907#c38 I hope the work Marc-Antoine does in https://bugs.chromium.org/p/chromium/issues/detail?id=704236 will reduce load on our GAE instances, and this random stalls will disappear.
,
Apr 28 2017
Could you please investigate this further? Random 30 second stalls seems like a *bad* thing and shouldn't be happening no matter the load. If you think it is a problem in AppEngine we should talk to AppEngine SRE and see if they can help us diagnose the problem.
,
Jun 28 2017
This was before M-A's scheduler optimizations, please reopen if it's still happening.
,
Jun 29 2017
estaab@ I don't think this has anything to do with the scheduler? vadimsh@ said "recent logs of slow /tasks/new and they all have unexplainable stalls". This feels like an issue in request handling or something configured wrong on appengine? I really think this still needs investigating.
,
Jun 29 2017
The swarming scheduler (before M-A optimizations) was huge CPU hog. GAE instances were CPU-throttled and it caused all request to be slow. Spot checking /tasks/new logs now shows they are consistently <2s (usually under 500 ms). Monarch metrics seem to agree: http://shortn/_s6kYemX4Zv (though they tend to lie, since they don't include time spent in GAE routing guts, that get affected by CPU overloads too).
,
Jun 29 2018
Issue has not been modified or commented on in the last 365 days, please re-open or file a new bug if this is still an issue. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot |
|||||||||||||||||||||||
►
Sign in to add a comment |
|||||||||||||||||||||||
Comment 1 by tansell@chromium.org
, Apr 24 2017Components: Infra