Wptserve intermittently hangs when starting. |
|||||||||
Issue descriptionSee bug 669194 for context. More details to follow tomorrow when I'm more awake :).
,
Nov 30 2016
I will own this for now until I can at least fill in more of the information.
,
Nov 30 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/2431536e807485af5161318d9d5ae6b594f7cf1b commit 2431536e807485af5161318d9d5ae6b594f7cf1b Author: dpranke <dpranke@chromium.org> Date: Wed Nov 30 08:22:01 2016 Suppress wptserve-related layout test failures temporarily. TBR=tkent@chromium.org NOTRY=true BUG=669194, 669802 Review-Url: https://codereview.chromium.org/2540853003 Cr-Commit-Position: refs/heads/master@{#435163} [modify] https://crrev.com/2431536e807485af5161318d9d5ae6b594f7cf1b/third_party/WebKit/LayoutTests/TestExpectations
,
Dec 2 2016
Punting this to qyearsley now that he's back. qyearsley, feel free to punt to someone else.
,
Dec 7 2016
Is this the only thing blocking re-enabling wptserve? It isn't linked from https://codereview.chromium.org/2537413002 while issue 669194 is, so I'm not sure :) Context: mkwst@ was trying to add tests today that depend on wptserve, and got timeouts instead.
,
Dec 7 2016
Yes, this is the only thing blocking it.
,
Dec 8 2016
So, to clarify the relationship between this and bug 669194 -- on the instances of hanging for that bug, we saw something like:
11:51:02.680 3172 Starting WPTServe ...
11:51:02.684 3172 stale wptserve pid file, pid 27540
2016/11/28 11:55:24 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2016/11/28 11:59:24 transport: http2Client.notifyError got notified that the client transport was broken EOF.
...
And we think that the http2Client.notifyError is not a message from the layout test runner or wptserve, but instead is a message from logdog. If this can be reproduced somewhere, then we should expect to see the "stale wptserve pid file" message.
Also, it's possible/likely that "WebKit {Linux,Mac,Win} - WPTServe" on chromium.fyi are occasionally hitting this now, and so we are likely able to find recent logs where the test runner hung when starting wptserve, right?
,
Dec 8 2016
Correct, though it may not happen often enough that the two builders will be enough to repro it (I saw two because I didn't see it happening on windows). Also, the logdog message should now be gone, so this failure should now translate into the build being killed off after an hour w/o activity.
,
Dec 8 2016
Note, I searched for past very-long webkit_tests steps on "WebKit {Linux,Mac} - WPTServe" and didn't find anything more than 3000 seconds long except for this 10-day-old "WebKit Linux - WPTServe" job:
https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Linux%20-%20WPTServe/builds/7278/steps/webkit_tests/logs/stdio
Which also ends with:
11:22:08.246 30745 Starting WPTServe ...
11:22:08.246 30745 stale wptserve pid file, pid 22900
2016/11/28 11:27:00 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2016/11/28 11:31:01 transport: http2Client.notifyError got notified that the client transport was broken EOF.
Note: I've now interrupted that build.
The "stale wptserve pid file" message is printed at:
https://cs.chromium.org/chromium/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/server_base.py?l=93
So one possible hypothesis is that it's hanging in ServerBase._stop_running_server, right after printing that message.
,
Dec 8 2016
Yup, that's what I saw and hypothesized as well. My guess was that wptserve wasn't correctly responding to the SIGKILL that _stop_running_server() sends, and perhaps we need to override the method to just send a SIGTERM instead (or do something else that provides a more reliable shutdown).
,
Dec 12 2016
Looking again, it appears that when ServerBase._stop_running_server is called, it waits for ServerBase._check_and_kill, which first calls Executive.check_running_pid, then logs a message ("pid %d is running, killing it'"), and then uses SIGKILL on Mac or Linux and taskkill.exe on Windows.
Since that message is never logged, would that mean that it's definitely hanging in Executive.check_running_pid? Or could it still be the case that _log.debug is called but the message is never seen in the log?
,
Dec 15 2016
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/4d6a4b9b7ca72f5ed0d7a25aa62d8dbb22ce1a63 commit 4d6a4b9b7ca72f5ed0d7a25aa62d8dbb22ce1a63 Author: qyearsley <qyearsley@chromium.org> Date: Thu Dec 15 21:01:28 2016 Add logging in Executive.check_running_pid. The purpose of this CL is to add logging in an attempt to confirm whether check_running_pid is called before a hang occurs when shutting down wptserve. BUG= 669802 Review-Url: https://codereview.chromium.org/2569643002 Cr-Commit-Position: refs/heads/master@{#438921} [modify] https://crrev.com/4d6a4b9b7ca72f5ed0d7a25aa62d8dbb22ce1a63/third_party/WebKit/Tools/Scripts/webkitpy/common/system/executive.py
,
Dec 20 2016
Log in a hang case: https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Linux%20-%20WPTServe/builds/7513/steps/webkit_tests/logs/stdio 16:59:59.690 580 Starting WPTServe ... 16:59:59.690 580 stale wptserve pid file, pid 13335 16:59:59.690 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. 16:59:59.691 580 Checking whether pid 13335 is alive. .... It seems the previous run of webkit_tests was interrupted. https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Linux%20-%20WPTServe/builds/7512/steps/webkit_tests/logs/stdio
,
Dec 21 2016
Excellent, thanks for posting it! I'm still not clear how this might be happening though. At first, it looks like it could be looping in ServerBase._wait_for_action (https://cs.chromium.org/chromium/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/server_base.py?l=177), but if that's the case, then it should have 1 second between logging the message, and it should try for 20 seconds (not 114 hours...) Any other hypotheses about where this could be looping, besides ServerBase._wait_for_action?
,
Dec 21 2016
I think this is an infinite loop in _stop_running_server() in wptserve.py. My wild guess is: self._executive.interrupt(self._pid) sends a signal to the target process, but the process becomes a zombie, and self._executive.check_running_pid() still returns true.
,
Dec 21 2016
Can anyone login slave72-c1, and check process tree including pid=13335?
,
Dec 21 2016
Bingo! The bot still have a process tree for a run-webkit-tests kicked for an older build.
pid ppid args:
13220 1 python -u /b/rr/tmp1bW0dw/rw/checkout/scripts/slave/runtest.py --target Release --no-xvfb --factory-properties {"blamelist": ["dgozman@chromium.org", "mmenke@chromium.org", "twellington@ch
13254 13220 /usr/bin/python /b/rr/tmp1bW0dw/rw/checkout/scripts/slave/chromium/layout_test_wrapper.py --target Release -o /b/rr/tmp1bW0dw/w/layout-test-results --build-dir /b/c/b/WebKit_Linux___WPTSer
13266 13254 python /b/c/b/WebKit_Linux___WPTServe/src/third_party/WebKit/Tools/Scripts/run-webkit-tests --no-show-results --full-results-html --clobber-old-results --exit-after-n-failures 5000 --exit-
13267 13266 /usr/bin/python /b/c/b/WebKit_Linux___WPTServe/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py --no-show-results --full-results-html --clobber-old-results --
13335 13267 [python] <defunct>
There are many other zombie process in the bot.
544 533 [Xvfb] <defunct>
556 533 [openbox] <defunct>
648 13372 [content_shell] <defunct>
3276 13371 [content_shell] <defunct>
12013 13374 [content_shell] <defunct>
13231 13220 [Xvfb] <defunct>
13243 13220 [openbox] <defunct>
13335 13267 [python] <defunct>
13355 13267 [apache2] <defunct>
13703 13376 [content_shell] <defunct>
16554 13370 [content_shell] <defunct>
20301 13377 [content_shell] <defunct>
26458 13368 [content_shell] <defunct>
26535 13368 [content_shell] <defunct>
28771 13369 [content_shell] <defunct>
28961 13369 [content_shell] <defunct>
So the right solution would be to kill runtest.py?
,
Dec 21 2016
Is it possible that that builder is not configured to auto reboot? There shouldn't be any old instances running because we should reboot in between.
,
Dec 27 2016
,
Jan 11 2017
,
Jan 11 2017
Quick update: looking at _stop_running_server, it looks like the infinite loop wouldn't happen if the ServerBase._stop_running_server were used instead of WPTServe._stop_running_server. There may be some context I'm missing about why an overridden version of this method was necessary. If the overridden version were removed (http://crrev.com/2624093003), then I think that it would try killing the process with kill_process with SIGKILL instead of SIGINT on non-Windows platforms, for up to 20 seconds, and then stop trying to kill it and try to continue. That said, Dirk has a good point in #18: if the machines are rebooting between runs then there shouldn't be any stale pid files left around...
,
Jan 11 2017
,
Jan 12 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/66e610ba7ddabd8f30757b0378aa4185b893cdf4 commit 66e610ba7ddabd8f30757b0378aa4185b893cdf4 Author: qyearsley <qyearsley@chromium.org> Date: Thu Jan 12 00:35:33 2017 Remove the overridden _stop_running_server method in WPTServe. Background: When support for wptserve was originally added [1], WPTServe had a _stop_running_server to override the version in ServerBase, which tries to kill the existing process with Executive.interrupt instead of Executive.kill_process on non-Windows platforms. Later this method was changed in http://crrev.com/2511013002 and now when interrupt fails to kill the process, there's an infinite loop which results in the job stalling for days. But, I'm not sure why it was necessary to use Executive.interrupt rather than Executive.kill_process. If the overridden method is removed, then ServerBase._stop_running_server would be used, which is: def _stop_running_server(self): self._wait_for_action(self._check_and_kill) if self._filesystem.exists(self._pid_file): self._filesystem.remove(self._pid_file) [1] https://chromium.googlesource.com/chromium/src/+/0129094b0ad431a2aee9cf5ea9225037283f3c37/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/wptserve.py Note, the actual change to behavior here is in webkitpy/layout_tests/servers/wptserve.py -- the rest of the changes are related to unit testing; I could also put those changes in a separate CL. BUG= 669802 Review-Url: https://codereview.chromium.org/2624093003 Cr-Commit-Position: refs/heads/master@{#443083} [modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/common/system/executive_mock.py [modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/common/system/system_host.py [modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/common/system/system_host_mock.py [modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/server_base.py [modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/wptserve.py [add] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/wptserve_unittest.py
,
Jan 13 2017
In recent builds, we haven't seen any hanging for a while after that last change: https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Linux%20-%20WPTServe https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Mac%20-%20WPTServe Will keep watching and close if no more problems are observed.
,
Jan 13 2017
When this is closed will wptserve be turned back on as the default? Or is that a separate bug I should subscribe to?
,
Jan 13 2017
I think bug 618366 is for running wptserve by default?
,
Jan 13 2017
Yep, that's right - and now I think there's nothing blocking us from re-enabling it :-D We should update WPTServeExpectations and make sure the FYI bots are happy first though.
,
Jan 13 2017
,
Jan 16 2017
It seems both of "WebKit Linux - WPTServe" and "WebKit Linux Trusty" bots automatically reboot on every build now. It should be safe to enable WPTServe.
,
Jan 16 2017
Aye :-D
,
Jan 20 2017
The following revision refers to this bug: https://chromium.googlesource.com/chromium/src.git/+/896b0f4ea756385f66de7b31c52fccf728a02cb6 commit 896b0f4ea756385f66de7b31c52fccf728a02cb6 Author: qyearsley <qyearsley@chromium.org> Date: Fri Jan 20 01:46:13 2017 When killing wptserve, use Executive.interrupt on non-Windows platforms. Before http://crrev.com/2624093003, the test runner would hang in a busy loop while trying to kill wptserve. After that CL, though, when running run-webkit-tests locally on Linux, wptserve wouldn't be effectively killed before the next run. This CL is meant to make it so that it behaves more like it did before (using interrupt instead of kill for non-Windows platforms), but still only tries to kill wptserve for 20 seconds, with 1-second sleeps between attempts. I tried locally to see what would happen if I undid the change in http://crrev.com/2624093003 but only changed _stop_running_server to use kill_process on all platforms (not just Windows), and confirmed that this apparently doesn't kill the process properly on my workstation. BUG= 669802 , 682711 Review-Url: https://codereview.chromium.org/2649433002 Cr-Commit-Position: refs/heads/master@{#444933} [modify] https://crrev.com/896b0f4ea756385f66de7b31c52fccf728a02cb6/third_party/WebKit/Tools/Scripts/webkitpy/common/system/executive_mock.py [modify] https://crrev.com/896b0f4ea756385f66de7b31c52fccf728a02cb6/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/wptserve.py |
|||||||||
►
Sign in to add a comment |
|||||||||
Comment 1 by dpranke@chromium.org
, Nov 30 2016