New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 669802 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Jan 2017
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: ----
Pri: 1
Type: Bug

Blocking:
issue 680303



Sign in to add a comment

Wptserve intermittently hangs when starting.

Project Member Reported by dpranke@chromium.org, Nov 30 2016

Issue description

See bug 669194 for context. More details to follow tomorrow when I'm more awake :).
 
I'm turning off wptserve temporarily and suppressing the failures until we can figure out better what's going on.
Owner: dpranke@chromium.org
Status: Assigned (was: Available)
I will own this for now until I can at least fill in more of the information.
Project Member

Comment 3 by bugdroid1@chromium.org, Nov 30 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/2431536e807485af5161318d9d5ae6b594f7cf1b

commit 2431536e807485af5161318d9d5ae6b594f7cf1b
Author: dpranke <dpranke@chromium.org>
Date: Wed Nov 30 08:22:01 2016

Suppress wptserve-related layout test failures temporarily.

TBR=tkent@chromium.org
NOTRY=true
BUG=669194, 669802 

Review-Url: https://codereview.chromium.org/2540853003
Cr-Commit-Position: refs/heads/master@{#435163}

[modify] https://crrev.com/2431536e807485af5161318d9d5ae6b594f7cf1b/third_party/WebKit/LayoutTests/TestExpectations

Cc: jeffcarp@chromium.org
Owner: qyears...@chromium.org
Punting this to qyearsley now that he's back. qyearsley, feel free to punt to someone else.
Cc: mkwst@chromium.org
Is this the only thing blocking re-enabling wptserve? It isn't linked from https://codereview.chromium.org/2537413002 while issue 669194 is, so I'm not sure :)

Context: mkwst@ was trying to add tests today that depend on wptserve, and got timeouts instead.
Yes, this is the only thing blocking it.
So, to clarify the relationship between this and bug 669194 -- on the instances of hanging for that bug, we saw something like:

11:51:02.680 3172 Starting WPTServe ...
11:51:02.684 3172 stale wptserve pid file, pid 27540
2016/11/28 11:55:24 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2016/11/28 11:59:24 transport: http2Client.notifyError got notified that the client transport was broken EOF.
...


And we think that the http2Client.notifyError is not a message from the layout test runner or wptserve, but instead is a message from logdog. If this can be reproduced somewhere, then we should expect to see the "stale wptserve pid file" message.

Also, it's possible/likely that "WebKit {Linux,Mac,Win} - WPTServe" on chromium.fyi are occasionally hitting this now, and so we are likely able to find recent logs where the test runner hung when starting wptserve, right?
Correct, though it may not happen often enough that the two builders will be enough to repro it (I saw two because I didn't see it happening on windows).

Also, the logdog message should now be gone, so this failure should now translate into the build being killed off after an hour w/o activity.
Note, I searched for past very-long webkit_tests steps on "WebKit {Linux,Mac} - WPTServe" and didn't find anything more than 3000 seconds long except for this 10-day-old "WebKit Linux - WPTServe" job:

https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Linux%20-%20WPTServe/builds/7278/steps/webkit_tests/logs/stdio

Which also ends with:

11:22:08.246 30745 Starting WPTServe ...
11:22:08.246 30745 stale wptserve pid file, pid 22900
2016/11/28 11:27:00 transport: http2Client.notifyError got notified that the client transport was broken EOF.
2016/11/28 11:31:01 transport: http2Client.notifyError got notified that the client transport was broken EOF.

Note: I've now interrupted that build.


The "stale wptserve pid file" message is printed at:

https://cs.chromium.org/chromium/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/server_base.py?l=93

So one possible hypothesis is that it's hanging in ServerBase._stop_running_server, right after printing that message.
Yup, that's what I saw and hypothesized as well.

My guess was that wptserve wasn't correctly responding to the SIGKILL that _stop_running_server() sends, and perhaps we need to override the method to just send a SIGTERM instead (or do something else that provides a more reliable shutdown).
Looking again, it appears that when ServerBase._stop_running_server is called, it waits for ServerBase._check_and_kill, which first calls Executive.check_running_pid, then logs a message ("pid %d is running, killing it'"), and then uses SIGKILL on Mac or Linux and taskkill.exe on Windows.

Since that message is never logged, would that mean that it's definitely hanging in Executive.check_running_pid? Or could it still be the case that _log.debug is called but the message is never seen in the log?
Project Member

Comment 12 by bugdroid1@chromium.org, Dec 15 2016

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/4d6a4b9b7ca72f5ed0d7a25aa62d8dbb22ce1a63

commit 4d6a4b9b7ca72f5ed0d7a25aa62d8dbb22ce1a63
Author: qyearsley <qyearsley@chromium.org>
Date: Thu Dec 15 21:01:28 2016

Add logging in Executive.check_running_pid.

The purpose of this CL is to add logging in an attempt to confirm
whether check_running_pid is called before a hang occurs when
shutting down wptserve.

BUG= 669802 

Review-Url: https://codereview.chromium.org/2569643002
Cr-Commit-Position: refs/heads/master@{#438921}

[modify] https://crrev.com/4d6a4b9b7ca72f5ed0d7a25aa62d8dbb22ce1a63/third_party/WebKit/Tools/Scripts/webkitpy/common/system/executive.py

Comment 13 by tkent@chromium.org, Dec 20 2016

Log in a hang case:
https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Linux%20-%20WPTServe/builds/7513/steps/webkit_tests/logs/stdio

16:59:59.690 580 Starting WPTServe ...
16:59:59.690 580 stale wptserve pid file, pid 13335
16:59:59.690 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
16:59:59.691 580 Checking whether pid 13335 is alive.
....

It seems the previous run of webkit_tests was interrupted.
https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Linux%20-%20WPTServe/builds/7512/steps/webkit_tests/logs/stdio

Excellent, thanks for posting it!

I'm still not clear how this might be happening though. At first, it looks like it could be looping in ServerBase._wait_for_action (https://cs.chromium.org/chromium/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/server_base.py?l=177), but if that's the case, then it should have 1 second between logging the message, and it should try for 20 seconds (not 114 hours...)

Any other hypotheses about where this could be looping, besides ServerBase._wait_for_action?

Comment 15 by tkent@chromium.org, Dec 21 2016

I think this is an infinite loop in _stop_running_server() in wptserve.py.

My wild guess is:
self._executive.interrupt(self._pid) sends a signal to the target process, but the process becomes a zombie, and self._executive.check_running_pid() still returns true.


Comment 16 by tkent@chromium.org, Dec 21 2016

Can anyone login slave72-c1, and check process tree including pid=13335?

Comment 17 by tkent@chromium.org, Dec 21 2016

Bingo!  The bot still have a process tree for a run-webkit-tests kicked for an older build.

pid ppid args:
13220     1 python -u /b/rr/tmp1bW0dw/rw/checkout/scripts/slave/runtest.py --target Release --no-xvfb --factory-properties {"blamelist": ["dgozman@chromium.org", "mmenke@chromium.org", "twellington@ch
13254 13220 /usr/bin/python /b/rr/tmp1bW0dw/rw/checkout/scripts/slave/chromium/layout_test_wrapper.py --target Release -o /b/rr/tmp1bW0dw/w/layout-test-results --build-dir /b/c/b/WebKit_Linux___WPTSer
13266 13254 python /b/c/b/WebKit_Linux___WPTServe/src/third_party/WebKit/Tools/Scripts/run-webkit-tests --no-show-results --full-results-html --clobber-old-results --exit-after-n-failures 5000 --exit-
13267 13266 /usr/bin/python /b/c/b/WebKit_Linux___WPTServe/src/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/run_webkit_tests.py --no-show-results --full-results-html --clobber-old-results --
13335 13267 [python] <defunct>


There are many other zombie process in the bot.
  544   533 [Xvfb] <defunct>
  556   533 [openbox] <defunct>
  648 13372 [content_shell] <defunct>
 3276 13371 [content_shell] <defunct>
12013 13374 [content_shell] <defunct>
13231 13220 [Xvfb] <defunct>
13243 13220 [openbox] <defunct>
13335 13267 [python] <defunct>
13355 13267 [apache2] <defunct>
13703 13376 [content_shell] <defunct>
16554 13370 [content_shell] <defunct>
20301 13377 [content_shell] <defunct>
26458 13368 [content_shell] <defunct>
26535 13368 [content_shell] <defunct>
28771 13369 [content_shell] <defunct>
28961 13369 [content_shell] <defunct>

So the right solution would be to kill runtest.py?

Is it possible that that builder is not configured to auto reboot? There shouldn't be any old instances running because we should reboot in between. 
Labels: -OS-All
Summary: Wptserve intermittently hangs when starting. (was: starting wptserve seems to flakily hang)
Blocking: 678847
Quick update: looking at _stop_running_server, it looks like the infinite loop wouldn't happen if the ServerBase._stop_running_server were used instead of WPTServe._stop_running_server. There may be some context I'm missing about why an overridden version of this method was necessary.

If the overridden version were removed (http://crrev.com/2624093003), then I think that it would try killing the process with kill_process with SIGKILL instead of SIGINT on non-Windows platforms, for up to 20 seconds, and then stop trying to kill it and try to continue.

That said, Dirk has a good point in #18: if the machines are rebooting between runs then there shouldn't be any stale pid files left around...
Blocking: 680303
Project Member

Comment 23 by bugdroid1@chromium.org, Jan 12 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/66e610ba7ddabd8f30757b0378aa4185b893cdf4

commit 66e610ba7ddabd8f30757b0378aa4185b893cdf4
Author: qyearsley <qyearsley@chromium.org>
Date: Thu Jan 12 00:35:33 2017

Remove the overridden _stop_running_server method in WPTServe.

Background: When support for wptserve was originally added [1],
WPTServe had a _stop_running_server to override the version in
ServerBase, which tries to kill the existing process with
Executive.interrupt instead of  Executive.kill_process on non-Windows
platforms.

Later this method was changed in http://crrev.com/2511013002 and
now when interrupt fails to kill the process, there's an infinite
loop which results in the job stalling for days.

But, I'm not sure why it was necessary to use Executive.interrupt
rather than Executive.kill_process. If the overridden method is removed,
then ServerBase._stop_running_server would be used, which is:

    def _stop_running_server(self):
        self._wait_for_action(self._check_and_kill)
        if self._filesystem.exists(self._pid_file):
            self._filesystem.remove(self._pid_file)

[1] https://chromium.googlesource.com/chromium/src/+/0129094b0ad431a2aee9cf5ea9225037283f3c37/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/wptserve.py

Note, the actual change to behavior here is in webkitpy/layout_tests/servers/wptserve.py -- the rest of the changes are related to unit testing; I could also put those changes in a separate CL.

BUG= 669802 

Review-Url: https://codereview.chromium.org/2624093003
Cr-Commit-Position: refs/heads/master@{#443083}

[modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/common/system/executive_mock.py
[modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/common/system/system_host.py
[modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/common/system/system_host_mock.py
[modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/server_base.py
[modify] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/wptserve.py
[add] https://crrev.com/66e610ba7ddabd8f30757b0378aa4185b893cdf4/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/wptserve_unittest.py

In recent builds, we haven't seen any hanging for a while after that last change:

https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Linux%20-%20WPTServe
https://build.chromium.org/p/chromium.fyi/builders/WebKit%20Mac%20-%20WPTServe

Will keep watching and close if no more problems are observed.
When this is closed will wptserve be turned back on as the default? Or is that a separate bug I should subscribe to?

Comment 26 by mek@chromium.org, Jan 13 2017

I think  bug 618366  is for running wptserve by default?
Yep, that's right - and now I think there's nothing blocking us from re-enabling it :-D We should update WPTServeExpectations and make sure the FYI bots are happy first though.
Blocking: -678847

Comment 29 by tkent@chromium.org, Jan 16 2017

It seems both of "WebKit Linux - WPTServe" and "WebKit Linux Trusty" bots automatically reboot on every build now.  It should be safe to enable WPTServe.

Status: Fixed (was: Assigned)
Aye :-D
Project Member

Comment 31 by bugdroid1@chromium.org, Jan 20 2017

The following revision refers to this bug:
  https://chromium.googlesource.com/chromium/src.git/+/896b0f4ea756385f66de7b31c52fccf728a02cb6

commit 896b0f4ea756385f66de7b31c52fccf728a02cb6
Author: qyearsley <qyearsley@chromium.org>
Date: Fri Jan 20 01:46:13 2017

When killing wptserve, use Executive.interrupt on non-Windows platforms.

Before http://crrev.com/2624093003, the test runner would hang in a busy
loop while trying to kill wptserve.

After that CL, though, when running run-webkit-tests locally on Linux,
wptserve wouldn't be effectively killed before the next run.

This CL is meant to make it so that it behaves more like it did before
(using interrupt instead of kill for non-Windows platforms), but still
only tries to kill wptserve for 20 seconds, with 1-second sleeps between
attempts.

I tried locally to see what would happen if I undid the change in
http://crrev.com/2624093003 but only changed _stop_running_server to
use kill_process on all platforms (not just Windows), and confirmed that
this apparently doesn't kill the process properly on my workstation.

BUG= 669802 , 682711 

Review-Url: https://codereview.chromium.org/2649433002
Cr-Commit-Position: refs/heads/master@{#444933}

[modify] https://crrev.com/896b0f4ea756385f66de7b31c52fccf728a02cb6/third_party/WebKit/Tools/Scripts/webkitpy/common/system/executive_mock.py
[modify] https://crrev.com/896b0f4ea756385f66de7b31c52fccf728a02cb6/third_party/WebKit/Tools/Scripts/webkitpy/layout_tests/servers/wptserve.py

Sign in to add a comment