Time out errors: webkit_layout_tests failing on multiple builders |
||||||||||||||||||||||||
Issue descriptionFiled by sheriff-o-matic@appspot.gserviceaccount.com on behalf of hajimehoshi@google.com webkit_layout_tests failing on multiple builders Builders failed on: - Mac10.12 Tests: https://build.chromium.org/p/chromium.mac/builders/Mac10.12%20Tests - WebKit Linux Trusty (dbg): https://build.chromium.org/p/chromium.webkit/builders/WebKit%20Linux%20Trusty%20%28dbg%29
,
Dec 12 2017
Interesting, on the Mac builder it kinda looks like the swarming client is hanging or timing out before the task even gets started: https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/8156 https://chromium-swarm.appspot.com/task?id=3a623607d887df10&refresh=10&request_detail=true&show_raw=1 https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/8155 https://chromium-swarm.appspot.com/task?id=3a62119085ee5e10&refresh=10&show_raw=1 https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/8153 https://chromium-swarm.appspot.com/task?id=3a61cab15355ab10&refresh=10&show_raw=1 Marc-Antoine, is that what it looks like to you? The Linux failures look like normal test failures, I don't see any timeouts.
,
Dec 12 2017
Issue 794277 has been merged into this issue.
,
Dec 12 2017
Sometimes there is output from the bot but it didn't complete: https://chromium-swarm.appspot.com/task?id=3a650f7115ddba10&refresh=10&show_raw=1 Sometimes there is no output in the logs: https://chromium-swarm.appspot.com/task?id=3a6533b9812ae310&refresh=10&show_raw=1
,
Dec 12 2017
It's an I/O timeout. This means the task didn't print anything out for more than 20 minutes (the default value).
,
Dec 15 2017
Fadi observed this on mac_chromium_rel_ng as well. Example: https://ci.chromium.org/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/612406 Timed-out shard: https://chromium-swarm.appspot.com/task?id=3a6eccbfc8b41310 Other examples on that builder: https://ci.chromium.org/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/612323 https://ci.chromium.org/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/612611 One hypothesis here was that the "899 files remaining..." part takes up to 20 minutes. In any case, the task start date is a long time before the first log lines from run-webkit-tests. The "N files remaining" lines are logged from the fetch_isolated function in isolateserver.py... I remember layout tests was a bit of an outlier in how many little files it required. Would it be helpful to add timestamps to the logs from fetch_isolated? Any other possible causes?
,
Dec 20 2017
The hypothesis is incorrect. The task duration as listed excludes the overhead. The I/O timeout happens precisely at the end of the output, not before, because well, there was output which resets the I/O timer. So whatever happens after "09:24:37.845 1742 worker/0 virtual/exotic-color-space/images/pixel-crack-image-background-webkit-transform-scale.html passed" is the problem. You'll have to prove me that it is an infrastructure problem before I investigate. I recommend running the step manually via "Debug a task".
,
Dec 20 2017
* I meant: you'll have to prove me that the task is not hanging, e.g. that user code is not at fault. Have you looked at the webkit test that normally runs right after the last one in the stdout of the ones that hang?
,
Dec 20 2017
+1 to maruel@ and removing from Troopers queue and InfraInfra>Platform>Swarming component.
,
Dec 20 2017
M-A, re #7: wouldn't we expect 20 minutes of silence following that last 09:24:37 line in that case? (i.e., the task completion time would be something like 12:44 EST rather than 12:24:38 EST / 09:24:38 PST)?
,
Dec 21 2017
(Reminder that times are reported in the Web UI in your time zone) That seems like a good gotcha, John. In particular I just realized the additional 1 minute is the grace period for SIGTERM, one for the process, one for run_isolated uploading results back. Not sure why run_isolated would take 30 seconds to cleanup, filed issue 796754 to help with this. Spot checking 5 tasks from: https://chromium-swarm.appspot.com/tasklist?f=name%3Awebkit_layout_tests&f=os%3AMac-10.12.6&f=state%3ACOMPLETED The overhead is fairly consistently around 1m20s and they all report around "400 files remaining..." In this task, it was 899 files. It still doesn't make sense to take very close to 20 minutes for that, and the timing information doesn't line up. Will reassign to myself to take a deeper look. Sorry for the harsh wording in comment #7.
,
Jan 3 2018
Taking out of seriff queue as an owner is assigned.
,
Jan 8 2018
Issue 799266 has been merged into this issue.
,
Jan 8 2018
Issue 718336 has been merged into this issue.
,
Jan 24 2018
Issue 787680 has been merged into this issue.
,
Jan 24 2018
Hit this a few times today on https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/, at least.
,
Jan 24 2018
Bumping priority as this appears to be occurring quite frequently now.
,
Jan 25 2018
,
Feb 15 2018
,
Feb 15 2018
Sheriff today. Still very frequently occurring.
,
Feb 16 2018
,
Feb 16 2018
so what's the story here? Any plan or insights? This bug is causing a lot of work for sheriffs and affects all Chrome engineers productivity (as this also happens on try bots). I know we're all busy with our work, but it seems important to get this prioritized. Can we escalate this (in a positive way, to make sure people get the time needed to work on it)? [adding today's sheriffs so they know about the issue]
,
Feb 20 2018
,
Feb 21 2018
Adding Sheriff-Chromium to get the attension of the current sheriffs.
,
Feb 21 2018
,
Feb 21 2018
The timeout can be increased in the json file via 'io_timeout' value, defaulting to 1200. There's already many examples in the sources; e.g.: https://cs.chromium.org/chromium/src/testing/buildbot/chromium.perf.fyi.json?l=28 I will do a quick pass to increase the value for all 'wekbit_layout_test' tests to 2400. Issue 796757 is the permanent fix.
,
Feb 21 2018
thanks for the investigation! Is it just me, or does it sound a bit crazy to deal with timeouts of 20 or 40 minutes? Any idea what's taking so long and how that could be speed up? [this is independent of the immediate fix suggested in comment 26 -- which sounds like the right next step to do].
,
Feb 22 2018
It's not just you. There's no reason we should need I/O timeouts of more than a couple minutes for this test. Increasing the timeouts might help us see if something changes, but it shouldn't be considered a fix.
,
Feb 23 2018
Based on my investigations when I faced this issue, most of the jobs don't even start within the 20 minutes, few of them do. My worry of increasing the timeout (as suggested in #26) is that it would take longer until we are notified of the failure.
,
Mar 6 2018
Issue 818207 has been merged into this issue.
,
Mar 8 2018
,
Mar 8 2018
27 out of last 200 mac_chromium_rel_ng builds failed webkit_layout_tests. Can something be done, like nor running webkit_layout_tests on mac_chromium_rel_ng until this is fixed, or marking mac_chromium_rel_ng experimental?
,
Mar 9 2018
#32: dropping the suite would be vastly preferable to dropping the bot.
,
Mar 9 2018
Issue 818207 was merged into this issue, which was about multiple occurrences of webkit_layout_tests timing out on a Swarming shard with empty output. maruel@: do we believe those might have the same cause as stalling at "N files remaining"?
,
Mar 14 2018
,
Mar 14 2018
,
Mar 15 2018
Can something be done about this? It's failing all over the place.
,
Mar 17 2018
webkit_layout_tests is still intermittently timing out on the Mac10.12 Tests builder - https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/. WebKit Linux Trusty (dbg) seems to be fine. Any progress on this? This is making the CQ slower, we should aim for a quick resolution.
,
Mar 20 2018
Issue 822982 has been merged into this issue.
,
Mar 20 2018
M-A: is there anything we can do to help with this?
,
Mar 20 2018
A recent example of timeout: Build https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/11412 Shard https://chromium-swarm.appspot.com/task?id=3c5d050c5216d510&refresh=10&show_raw=1 Again, empty output, which makes me clueless... I would love to help from the Blink/webkitpy side, but don't really know where to start without any logs. Any ideas, Swarming folks?
,
Mar 20 2018
I've found several bots that have timed out and do have some output. An example is https://chromium-swarm.appspot.com/task?id=3c5bf9dc6017ad10&refresh=10&show_raw=1 Output seems to stop after "Parsing expectations..." Compare to bot that has completed: https://chromium-swarm.appspot.com/task?id=3c5cc3742cd00510&refresh=10&show_raw=1
,
Mar 20 2018
Hmm, parsing test expectations usually takes <1 min. It's not optimal, but not a known bottleneck either (and I can't reproduce the stall...) Is there a more programmatic way to drill into all these timeouts and find out: 1. the percentage of timeouts on various platforms 2. the stalling points (last couple lines of logs, if any)
,
Mar 21 2018
Handy link to see all the Mac bots that have failed with a timeout. There were 142 failures in the last 24 hours. https://chromium-swarm.appspot.com/tasklist?c=name&c=state&c=created_ts&c=duration&c=pending_time&c=pool&c=bot&et=1521612960000&f=state%3ATIMED_OUT&f=gpu%3A8086%3A0a2e&f=hidpi%3A0&f=os%3AMac-10.12.6&f=cpu%3Ax86-64&f=pool%3AChrome&l=50&n=true&s=created_ts%3Adesc&st=1521526440000 There must be a way to get some output off the servers that are timing out.
,
Mar 21 2018
John, I suspect the archival is not creating good isolated, so fetching is bad. But that's still not normal, the files should be relatively hot, so something is happening that makes the bot local isolated cache cold, which causes a lot of trashing. So either the tar files are not deterministic anymore, or the bot trashes its local cache.
,
Mar 21 2018
,
Mar 26 2018
Just a note to say this still seems to be causing many timeouts and keeping things slightly more interesting than normal for sheriffs :)
,
Mar 28 2018
Is this timeout caused by this issue? https://chromium-swarm.appspot.com/task?id=3c82b18765ca5d10&refresh=10&show_raw=1 from: https://ci.chromium.org/buildbot/tryserver.chromium.mac/mac_chromium_rel_ng/681294
,
Mar 29 2018
What log level is the isolateserver running at? From looking at the code, it seems like the isolateserver.py sets up a DeadlockDetector which would trigger after 5 minutes. We then go into a loop and download files. From my reading, the DeadlockDetector will output to logger.warning. If our log level is above that, we won't see that a deadlock happened, so if the for loop in isolateserver runs 4 files, which all trigger the deadlock detector warning we wouldn't see them and we'd timeout. It doesn't fix the problem, but it would at least let us know if we were hitting the 5 minute timeouts downloading chunks of data. This would also explain the empty log output as this code would be the first thing to log any data.
,
Mar 29 2018
I'm not sure if I'm reading this right, but from reading the code: On the isolate server, we call `wait` on the FetchQueue object. That will call `pull` on the channel. The channel `pull` will wait forever for an item to be put in the queue (we pass timeout None, which triggers a 30 sec timeout on the queue but we then continue to loop again and wait). On the fetch side, in ThreadPool `_run` it looks like if there is an exception raised, we drop the task? In that case, nothing will get written into the queue for the pull above which would leave the request stuck. It looks like we log a warning and info level message if the _run method gets an exception. So, is it possible we fail to fetch a resource from storage and just get wedged waiting for the response to be put in the queue?
,
Apr 2 2018
Issue 828143 has been merged into this issue.
,
Apr 4 2018
,
Apr 4 2018
It looks like ~20% of these tasks are still timing out, which is presumably causing a significant spike in the load and lots of flakiness. I think we need to take this out of the CQ until we can resolve this problem. (I'd be open to making this an experimental task if we think we have the capacity, though). Escalating to P0; maruel, which thing do you think we should do here?
,
Apr 4 2018
Re#53: Some of the timeouts (especially those on Mac) might be genuine timeouts due to increasing number of tests, and could be fixed by issue 828543. Do we know have a breakdown of how many tasks timeout with no output?
,
Apr 4 2018
Good question. I don't know how easy that would be to determine (since you can't easily get that from the swarming_tasks table, you'd have to do some much more complicated queries against the test_results). However, it looks like the average of the successful completed tasks is close to 20min, so you might be right. And, that's way too slow, so we need to add more shards as per bug 828543 regardless.
,
Apr 4 2018
I'm not sure why the "xxx files remaining..." is not printed out anymore.
,
Apr 4 2018
The following revision refers to this bug: https://chromium.googlesource.com/infra/luci/luci-py.git/+/3ee1cf02f587cb59951328240b85d81e56fe1001 commit 3ee1cf02f587cb59951328240b85d81e56fe1001 Author: Marc-Antoine Ruel <maruel@chromium.org> Date: Wed Apr 04 23:46:01 2018 [swarming] flush the 'xxx files remaining...' message to stdout It looks like it didn't flush properly on macOS, try to manually flush. This is merely an hack, it doesn't fix the underlying problem (huge overhead) but at least it should unblock users. Background: I've seen tasks TIMED_OUT on macOS due to 20 minutes io_timeout. The logs clearly show the 'xxx files remaining...' but they are not in the task's output. R=vadimsh@chromium.org Bug: 794085 Change-Id: Iea74019fccd9b2cf30dfcb1554a791778e283a3f Reviewed-on: https://chromium-review.googlesource.com/996693 Reviewed-by: Vadim Shtayura <vadimsh@chromium.org> Commit-Queue: Vadim Shtayura <vadimsh@chromium.org> Commit-Queue: Marc-Antoine Ruel <maruel@chromium.org> [modify] https://crrev.com/3ee1cf02f587cb59951328240b85d81e56fe1001/client/isolateserver.py
,
Apr 4 2018
Workaround worked; https://chromium-swarm-dev.appspot.com/task?id=3cabe2fa78d97310 I triggered it with: ./swarming.py trigger -S chromium-swarm-dev.appspot.com -I isolateserver.appspot.com -d os Mac -d pool Chrome --priority 10 -s 02f2fcc7e0f9326b6dd24f7031c2fe49a5f743b1 Will push to prod soon.
,
Apr 5 2018
Confirmed to work: https://chromium-swarm.appspot.com/task?id=3cabfbcd56424010 only had "277 files remaining..." when I first looked at the page, confirming that flushing stdout worked. It should be "fixed" to the extent that the tests will stop being marked as TIMED_OUT due to I/O timeout, but that doesn't fix the underlying performance issue. I'll keep the bug open a day just in case someone sees TIMED_OUT happening again. For my experiment in comment 58, the initial bot took over 20 min to map the files but the retry took less than 2 minutes, so something is happening that makes some Macs slower than others.
,
Apr 5 2018
Ah, nice find. 20m is (obviously) way too slow to map the files, but better than than failing the job and retrying.
,
Apr 5 2018
Also, is there some way we can change the swarming bot implementation so that we can monitor and alert on mapping times and/or set hard timeouts for mapping the task inputs and outputs as opposed to running the task? Perhaps also obviously, not being able to distinguish these different phases makes it harder to debug ...
,
Apr 5 2018
https://chromium-swarm.appspot.com/tasklist?c=name&c=created_ts&c=duration&c=pool&c=bot&c=name-tag&et=1522889280000&f=state%3ATIMED_OUT&l=500&n=true&s=created_ts%3Adesc&st=1522802820000 Overheads are shown, for example https://chromium-swarm.appspot.com/task?id=3cac1306f95e8d10 but there's a bug where if the task doesn't close correctly, it won't report these. I'll fix this soon.
,
Apr 5 2018
> Overheads are shown, for example > https://chromium-swarm.appspot.com/task?id=3cac1306f95e8d10 It looks like that data isn't replicated out into swarming_tasks. Any idea if it is replicated out into ts_mon?
,
Apr 5 2018
,
Apr 5 2018
,
Apr 5 2018
Since the workaround was deployed, there has been two exceptions on https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/ so it is "looking better". https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/11839 was because of lack of capacity; https://chromium-swarm.appspot.com/task?id=3cae4fd934b2b410&refresh=10 https://ci.chromium.org/buildbot/chromium.mac/Mac10.12%20Tests/11836 was the test crashing the window subsystem; https://chromium-swarm.appspot.com/task?id=3cad92bf41c6ff10
,
Apr 5 2018
Based on comment #59 onwards, removing Sheriff-Chromium as I think this is (hopefully) no longer relevant for the sheriff? Please re-add if I am mistaken.
,
Apr 6 2018
#53 dpranke@, what would it mean to "take this out of the CQ", is that to stop running layout tests for all or some platforms? Before that action is taken, can you please call a meeting including robertma@ and me to weigh the options? I'm concerned about the impact on all Blink developers of doing that, and also of the immense rebaselining work that'd fall to the ecosystem infra team to get certain configurations back online. Thanks!
,
Apr 9 2018
Pri-0 bugs are critical regressions or serious emergencies, and this bug has not been updated in three days. Could you please provide an update, or adjust the priority to a more appropriate level if applicable? If a fix is in active development, please set the status to Started. Thanks for your time! To disable nags, add the Disable-Nags label. For more details visit https://www.chromium.org/issue-tracking/autotriage - Your friendly Sheriffbot
,
Apr 9 2018
In the last 24h the timeouts where real failures. Closing. Ref: https://chromium-swarm.appspot.com/task?id=3cbebdfc2031e310 https://chromium-swarm.appspot.com/task?id=3cbd60f2bb319510 |
||||||||||||||||||||||||
►
Sign in to add a comment |
||||||||||||||||||||||||
Comment 1 by hajimehoshi@chromium.org
, Dec 12 2017Owner: dpranke@chromium.org