New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 714529 link

Starred by 0 users

Issue metadata

Status: Archived
Owner: ----
Closed: Jun 2018
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: ----
Type: ----



Sign in to add a comment

Investigate why triggering jobs on swarming sometimes takes 30 seconds +

Project Member Reported by tansell@chromium.org, Apr 24 2017

Issue description

From doing investigating of runs after deploying webkit layout tests on swarming we discovered that sometimes triggering a job on swarming takes a long time. Normally it completes in a couple of seconds (normally under 100ms!) but sometimes it can take 30 seconds +.

We don't know how frequently this occurs, nor why it is occuring. Need to do some querying to figure it out.
 
Cc: mar...@chromium.org
Components: Infra
Components: -Infra
No need to set generic Infra components when it has a clear subcomponent already.
I've looked at recent logs of slow /tasks/new, and they all have unexplainable stalls, like this one: https://screenshot.googleplex.com/bzGNOfw9UQR.png (a request comes in, does nothing for 30 sec, then hits the log line at the very start of the handler (before any expensive operations; what did it do for 30 sec?).

This looks very similar to https://bugs.chromium.org/p/chromium/issues/detail?id=704907#c38

I hope the work Marc-Antoine does in https://bugs.chromium.org/p/chromium/issues/detail?id=704236 will reduce load on our GAE instances, and this random stalls will disappear.
Could you please investigate this further? Random 30 second stalls seems like a *bad* thing and shouldn't be happening no matter the load.

If you think it is a problem in AppEngine we should talk to AppEngine SRE and see if they can help us diagnose the problem.

Comment 5 by estaab@chromium.org, Jun 28 2017

Status: WontFix (was: Untriaged)
This was before M-A's scheduler optimizations, please reopen if it's still happening.
Cc: vadimsh@chromium.org
Status: Unconfirmed (was: WontFix)
estaab@ I don't think this has anything to do with the scheduler? 

vadimsh@ said "recent logs of slow /tasks/new and they all have unexplainable stalls". This feels like an issue in request handling or something configured wrong on appengine?

I really think this still needs investigating.
The swarming scheduler (before M-A optimizations) was huge CPU hog. GAE instances were CPU-throttled and it caused all request to be slow.

Spot checking /tasks/new logs now shows they are consistently <2s (usually under 500 ms). Monarch metrics seem to agree: http://shortn/_s6kYemX4Zv (though they tend to lie, since they don't include time spent in GAE routing guts, that get affected by CPU overloads too).
Project Member

Comment 8 by sheriffbot@chromium.org, Jun 29 2018

Status: Archived (was: Unconfirmed)
Issue has not been modified or commented on in the last 365 days, please re-open or file a new bug if this is still an issue.

For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot

Sign in to add a comment