New issue
Advanced search Search tips

Issue 905012 link

Starred by 2 users

Issue metadata

Status: Duplicate
Merged: issue 899991
Owner: ----
Closed: Nov 23
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 2
Type: Bug



Sign in to add a comment

Mac webkit layout tests appeared stuck while running tests for a CL

Project Member Reported by c...@chromium.org, Nov 13

Issue description

My CL at https://crrev.com/c/1316830 was CQ+2 at 11:59a.  The Mac Rel build started at 12:00p and is still running.

(at 1:23p):
https://ci.chromium.org/p/chromium/builders/luci.chromium.try/mac_chromium_rel_ng/184483 shows webkit_layout_tests running, 25 mins elapsed

stdout: https://logs.chromium.org/logs/chromium/buildbucket/cr-buildbucket.appspot.com/8929943545456525328/+/steps/webkit_layout_tests_on_Intel_GPU_on_Mac__with_patch__on_Mac-10.12.6/0/stdout
======
...
13:00:13.979 59056     editing/undo/undo-smart-delete-word.html
13:00:13.979 59056     external/wpt/css/css-writing-modes/sizing-orthog-vlr-in-htb-020.xht
13:00:13.979 59056     external/wpt/css/css-writing-modes/sizing-orthog-vrl-in-htb-020.xht
13:00:13.979 59056     fast/events/frame-detached-in-mousedown.html
13:00:13.979 59056     fast/forms/select/menulist-appearance-rtl.html
13:00:13.979 59056     fast/text/drawBidiText.html
13:00:13.979 59056     virtual/layout_ng/fast/block/float/overhanging-tall-block.html
13:00:13.979 59056     virtual/layout_ng/fast/inline/inline-offsetLeft-continuation.html
13:00:13.979 59056     virtual/new-remote-playback-pipeline/media/controls/buttons-after-reset.html
13:00:13.979 59056     virtual/outofblink-cors-ns/http/tests/security/contentSecurityPolicy/object-src-does-not-affect-child.html
13:00:13.983 59056 
13:00:13.983 59056 Testing completed. Exit status: 0
+------------------------------------------------------------------------+
| End of shard 8                                                         |
|  Pending: 532.9s  Duration: 896.3s  Bot: build560-m4  Exit: 0          |
+------------------------------------------------------------------------+

Waiting for results from the following shards: 2, 5, 7, 9
======

From what I can see, 4 shards are stuck and the bot hasn't done anything for 23+ mins.

What caused this bot to be stuck?
 
Labels: -Pri-1 Pri-2
Summary: Mac webkit layout tests appeared stuck while running tests for a CL (was: Mac webkit layout tests are stuck while CQ+2 on my CL)
This bot ended up completing at 1:49p, or 26 mins after I was looking at the job.

1:23p I filed this bug, no output was in the logs
1:28p Shard 9 ended, showing a big text dump including timestamps 1:09p-1:28p
1:31p Shard 7 ended
1:32p Shard 2 ended
1:42p Shard 5 ended

Based on this, I'm updating the summary.  The bot wasn't stuck, it only appeared stuck.

We should fix the test output so it doesn't give the appearance that the job is stuck.
Components: -Infra Infra>Client>Chrome
Components: -Infra>Client>Chrome Infra>Platform>Swarming
Labels: Foundation-Troopers
All of the long shards (9, 7, 5, 2) have ~30 minute overheads. This is a semi KI, tracked in  bug 899991 . I'm trying out a fix which could help.

I don't know how feasible it'd be to have it print out a periodic message every 5 minutes or so when collecting tasks. Swarming people would know more here.
Cc: jpwilson@google.com
Status: Available (was: Untriaged)
+Jon as this is the kind of thing impacting the CI's runtime (thus reducing overall fleet throughput) that we'll have to monitor more closely and fix.

Airborne today so can't take a look now.
Re. adjusting output so the job doesn't look stuck, it looks like tasks updates are supposed to be emitted every 15min. [1]

Hypothesis: the first "Waiting for results" came in at 1:15. The second would have come in at 1:30, but was interrupted by Shard 9's finished output coming in at 1:28.

I've put together a CL to increase the frequency and also output the update time, which should help us see better if/when this happens in the future. [2]

[1] https://cs.chromium.org/chromium/infra/luci/client/swarming.py?l=654&rcl=473a850bc451ce86db312d3c209a30ccffee832b
[2] https://chromium-review.googlesource.com/c/infra/luci/luci-py/+/1340660
Project Member

Comment 6 by bugdroid1@chromium.org, Nov 19

The following revision refers to this bug:
  https://chromium.googlesource.com/infra/luci/luci-py.git/+/fa7445abcb3a53e2bddb60a2656582c5b34fcd8e

commit fa7445abcb3a53e2bddb60a2656582c5b34fcd8e
Author: Jao-ke Chin-Lee <jchinlee@chromium.org>
Date: Mon Nov 19 17:04:41 2018

[client] Print time with tasks update message. Also increase frequency.

BUG= 905012 

Change-Id: Id8349367695baaf204ce7c2489be44f32dd8fffb
Reviewed-on: https://chromium-review.googlesource.com/c/1340660
Commit-Queue: Jao-ke Chin-Lee <jchinlee@chromium.org>
Reviewed-by: Marc-Antoine Ruel <maruel@chromium.org>

[modify] https://crrev.com/fa7445abcb3a53e2bddb60a2656582c5b34fcd8e/client/swarming.py

Project Member

Comment 7 by bugdroid1@chromium.org, Nov 20

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/6472f2d72226f5f0776ad5301840b98012c5bb53

commit 6472f2d72226f5f0776ad5301840b98012c5bb53
Author: Marc-Antoine Ruel <maruel@chromium.org>
Date: Tue Nov 20 23:16:48 2018

Roll src/tools/swarming_client/ 7f463e66e..b6e9e23e4 (4 commits)

https://chromium.googlesource.com/infra/luci/client-py.git/+log/7f463e66e1c4..b6e9e23e4e79

$ git log 7f463e66e..b6e9e23e4 --date=short --no-merges --format='%ad %ae %s'
2018-11-20 maruel [client]: fix undefined class reference
2018-11-19 vadimsh [proto] Fix google/rpc/*_pb2.py, it has wrong proto paths in it.
2018-11-19 maruel protobuf: upgrade to 3.6.1 from 3.5.1
2018-11-19 jchinlee [client] Print time with tasks update message. Also increase frequency.

Created with:
  roll-dep src/tools/swarming_client

R=jchinlee@chromium.org

Bug:  905012 
Change-Id: I5e397d787e27223d94dd85b35d3c2826dc93febe
Reviewed-on: https://chromium-review.googlesource.com/c/1343666
Reviewed-by: Jao-ke Chin-Lee <jchinlee@chromium.org>
Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org>
Cr-Commit-Position: refs/heads/master@{#609845}
[modify] https://crrev.com/6472f2d72226f5f0776ad5301840b98012c5bb53/DEPS

Mergedinto: 899991
Status: Duplicate (was: Available)
This is essentially the same issue as  issue 899991 .

This is about *download* overhead, not about upload. Download overhead is influenced by how smart the archiver is, which is issue 854610. That said, as I noted in  issue 899991 , I suspect these VMs will gain a lot of performance by being redeployed.

Sign in to add a comment