Tast fails to restart ui job while session_manager is restarting |
|||||||||||||
Issue descriptiontast.ui.VirtualKeyboardOmnibox FAIL: Failed to start Chrome: stopping "ui" failed: status start/post-stop has non-"stop" goal
,
Oct 3
,
Oct 3
Hmm, it's failing at the very first step. It's the same code used by VirtualKeyboardTyping, so it's weird that only Omnibox fails.
,
Oct 3
I'll look into this. Maybe we're still not handling Upstart's various state transitions correctly. :-/
,
Oct 3
Test output: 2018/10/02 21:11:13 Started test ui.VirtualKeyboardOmnibox 2018/10/02 21:11:13 [21:11:12.621] Restarting ui job 2018/10/02 21:11:13 [21:11:12.804] External command failed: exit status 1 2018/10/02 21:11:13 [21:11:12.804] Command: 'initctl' 'stop' 'ui' 2018/10/02 21:11:13 [21:11:12.804] Uncaptured output: initctl: Job has already been stopped: ui 2018/10/02 21:11:13 [21:11:12.855] Saving minidump: /tmp/tast_out.414257251/ui.VirtualKeyboardOmnibox/tcsd.1110.dmp 2018/10/02 21:11:15 [21:11:14.150] Saving minidump: /tmp/tast_out.414257251/ui.VirtualKeyboardOmnibox/chapsd.1126.dmp 2018/10/02 21:11:16 [21:11:15.234] Saving minidump: /tmp/tast_out.414257251/ui.VirtualKeyboardOmnibox/cryptohomed.1219.dmp 2018/10/02 21:11:17 [21:11:16.794] Error at virtual_keyboard_omnibox.go:27: Failed to start Chrome: stopping "ui" failed: status start/post-stop has non-"stop" goal Here's the relevant code: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/HEAD/src/chromiumos/tast/local/upstart/upstart.go#120 When Tast wanted to restart the ui job, it first ran "initctl stop ui", which exited with 1: "initctl: Job has already been stopped: ui". We ignored the error since it looked like the job was already stopped. Then we tried to wait for the job status to go to stop/waiting (indicating that it's already stopped), but we saw start/post-stop. This doesn't make sense to me. Why didn't "initctl stop ui" change the goal to "stop"? Here's syslog at around the time of this. It looks like session_manager was in the process of exiting as a result of the previous test, ui.SupervisedUserCrash: 2018-10-03T04:11:12.616654+00:00 ERR session_manager[15011]: [ERROR:session_manager_service.cc(332)] Choosing to end session rather than restart browser. 2018-10-03T04:11:12.616685+00:00 INFO session_manager[15011]: [INFO:session_manager_impl.cc(382)] Emitting D-Bus signal SessionStateChanged:stopping 2018-10-03T04:11:12.616853+00:00 INFO session_manager[15011]: [INFO:system_utils_impl.cc(127)] Sending 15 to 15361 as 1000 2018-10-03T04:11:12.617142+00:00 INFO session_manager[15011]: [INFO:generator_job.cc(82)] Cleaned up child 15361 2018-10-03T04:11:12.617227+00:00 INFO session_manager[15011]: [INFO:session_manager_impl.cc(389)] Emitting D-Bus signal SessionStateChanged:stopped 2018-10-03T04:11:12.617367+00:00 INFO session_manager[15011]: [INFO:session_manager_service.cc(497)] SessionManagerService quitting run loop 2018-10-03T04:11:12.617508+00:00 INFO session_manager[15011]: [INFO:session_manager_service.cc(214)] SessionManagerService exiting 2018-10-03T04:11:12.621310+00:00 WARNING session_manager[15011]: [WARNING:session_manager_main.cc(262)] session_manager exiting with code 1 2018-10-03T04:11:12.625072+00:00 WARNING kernel: [ 834.860139] init: ui main process (15011) terminated with status 1 2018-10-03T04:11:12.660362+00:00 NOTICE ui-respawn[15506]: ui failed with exit status 1. 2018-10-03T04:11:12.682259+00:00 NOTICE ui-respawn[15517]: Respawning ui. I'm guessing there's a race when we try to stop and start the ui job while it's also being restarted in response to session_manager exiting. I'm wondering if this could be related to the Upstart bug tracked by https://bugs.launchpad.net/upstart/+bug/703800. Maybe Scott still remembers the relevant Upstart code here. TL;DR: Is there a reliable way to stop and then start an Upstart job regardless of what it's currently doing? The thing I'm trying is: a) run "initctl stop <job>"; disregard result b) wait for job status to change to stop/waiting c) run "initctl start <job>" I'm seeing a "start" goal after a).
,
Oct 3
,
Oct 3
Issue 882244 tracked my previous efforts to avoid races.
,
Oct 4
On the bright side, I'm able to reproduce this locally with "tast run" by passing "ui.SupervisedUserCrash ui.VirtualKeyboardOmnibox" as the test args: ... 2018/10/04 11:15:18 [11:15:18.748] Logging in as user "testuser@gmail.com" 2018/10/04 11:15:18 [11:15:18.805] Waiting for cryptohome for user "testuser@gmail.com" 2018/10/04 11:15:19 [11:15:19.429] Waiting for OOBE to be dismissed 2018/10/04 11:15:20 [11:15:20.335] Waiting for org.chromium.SessionManager D-Bus service 2018/10/04 11:15:20 Completed test ui.SupervisedUserCrash in 6.316s with 0 error(s) 2018/10/04 11:15:20 Started test ui.VirtualKeyboardOmnibox 2018/10/04 11:15:20 [11:15:20.504] Restarting ui job 2018/10/04 11:15:20 [11:15:20.663] External command failed: exit status 1 2018/10/04 11:15:20 [11:15:20.663] Command: 'initctl' 'stop' 'ui' 2018/10/04 11:15:20 [11:15:20.663] Uncaptured output: initctl: Job has already been stopped: ui ... 2018/10/04 11:15:23 [11:15:23.476] Error at virtual_keyboard_omnibox.go:27: Failed to start Chrome: stopping "ui" failed: status start/post-stop has non-"stop" goal However, when I also run: while true; do d=$(date +%H:%M:%S.%N); printf "%s " $d; status ui; sleep 0.1; done on the DUT while doing this, the status transitions are still nonsense: ... 11:15:19.966940720 ui start/running, process 19935 11:15:20.081245282 ui start/running, process 19935 11:15:20.201248923 ui start/running, process 19935 11:15:20.320523454 ui start/running, process 19935 11:15:20.437616639 ui start/running, process 19935 <--- here's where tast runs "initctl stop ui" 11:15:20.549812758 ui stop/stopping <--- ??? 11:15:20.672816233 ui start/post-stop, process 20570 11:15:20.790860350 ui start/post-stop, process 20570 11:15:20.907199346 ui start/post-stop, process 20570 11:15:21.023153251 ui start/post-stop, process 20570 ... 11:15:23.579302491 ui start/post-stop, process 20570 11:15:23.688788526 ui start/post-stop, process 20570 11:15:23.799139587 ui start/starting 11:15:23.906290402 ui start/starting 11:15:24.012038052 ui start/starting 11:15:24.118064047 ui start/starting 11:15:24.224644939 ui start/starting 11:15:24.332157651 ui start/starting 11:15:24.444373788 ui start/starting 11:15:24.556112588 ui start/running, process 20884 11:15:24.663143606 ui start/running, process 20884 ... Here's the corresponding section of /var/log/messages: 2018-10-04T11:15:20.498610-07:00 INFO session_manager[19935]: [INFO:session_manager_service.cc(310)] Browser process 19967 exited with signal 9 (Killed) ... 2018-10-04T11:15:20.499875-07:00 INFO session_manager[19935]: [INFO:session_manager_service.cc(497)] SessionManagerService quitting run loop 2018-10-04T11:15:20.499969-07:00 INFO session_manager[19935]: [INFO:session_manager_service.cc(214)] SessionManagerService exiting 2018-10-04T11:15:20.504465-07:00 WARNING session_manager[19935]: [WARNING:session_manager_main.cc(262)] session_manager exiting with code 1 2018-10-04T11:15:20.509859-07:00 WARNING kernel: [491824.425709] init: ui main process (19935) terminated with status 1 2018-10-04T11:15:20.513809-07:00 NOTICE kernel: [491824.429972] audit: type=1400 audit(1538676920.512:2463): avc: granted { execute } for pid=20547 comm="init" name="ui-respawn" dev="mmcblk0p5" ino=51061 scontext=u:r:cros_init:s0 tcontext=u:object_r:cros_init_shell_scripts:s0 tclass=file 2018-10-04T11:15:20.513837-07:00 NOTICE kernel: [491824.430172] audit: type=1400 audit(1538676920.512:2464): avc: granted { execute } for pid=20547 comm="init" name="dash" dev="mmcblk0p5" ino=96088 scontext=u:r:cros_init:s0 tcontext=u:object_r:sh_exec:s0 tclass=file 2018-10-04T11:15:20.514828-07:00 NOTICE kernel: [491824.430595] audit: type=1400 audit(1538676920.513:2465): avc: granted { execute } for pid=20547 comm="ui-respawn" path="/bin/dash" dev="mmcblk0p5" ino=96088 scontext=u:r:cros_init:s0 tcontext=u:object_r:sh_exec:s0 tclass=file 2018-10-04T11:15:20.529184-07:00 NOTICE ui-respawn[20554]: ui failed with exit status 1. 2018-10-04T11:15:20.565194-07:00 NOTICE ui-respawn[20566]: Respawning ui. ... 2018-10-04T11:15:24.542517-07:00 INFO session_manager[20884]: [INFO:session_manager_service.cc(152)] SessionManagerService starting So even though the job briefly goes to stop/stopping as a result of "initctl stop ui", it then goes through start/post-stop, start/starting, and start/running afterward. I'm pretty sure that this is the doing of the ui-respawn job, which is "start on stopping ui" and runs "start -n ui" to manually restart the ui job. I presumably need to either find some way to make ui-respawn not do that in this case, or else make the Tast upstart package tolerant of it happening...
,
Oct 4
Uploaded https://crrev.com/c/1262179 as a workaround.
,
Oct 5
Is this the reason for the ongoing tast.ui.VirtualKeyboardOmnibox failures on eve?
,
Oct 5
This is tracking test failures with the error "Failed to start Chrome: stopping "ui" failed: status start/post-stop has non-"stop" goal".
,
Oct 6
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/e405c3ed75bc256782d0db8e2ca03af1a6a75872 commit e405c3ed75bc256782d0db8e2ca03af1a6a75872 Author: Daniel Erat <derat@chromium.org> Date: Sat Oct 06 02:37:55 2018 tast-tests: Add ui restart hack to upstart.StopJob. The ui Upstart job is atypical, in that it's respawned by a separate ui-respawn job that is started on "stopping ui". As a result, if session_manager exits just before upstart.StopJob(ctx, "ui") is called, the job may get restarted even after the test runs "initctl stop ui". This change makes upstart.StopJob wait for the ui job to either have a "start" goal or reach "stop/waiting" while the ui-respawn job is in "stop-waiting". BUG= chromium:891594 TEST=ran ui.SupervisedUserCrash followed by ui.VirtualKeyboardOmnibox and saw the second test start chrome instead of getting an error like "Failed to start Chrome: stopping "ui" failed: status start/post-stop has non-"stop" goal" Change-Id: I2d542e3939cee6369fe477b2776cf5c39e51cd60 Reviewed-on: https://chromium-review.googlesource.com/1262179 Commit-Ready: ChromeOS CL Exonerator Bot <chromiumos-cl-exonerator@appspot.gserviceaccount.com> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> Reviewed-by: Hidehiko Abe <hidehiko@chromium.org> [modify] https://crrev.com/e405c3ed75bc256782d0db8e2ca03af1a6a75872/src/chromiumos/tast/local/upstart/upstart.go
,
Oct 8
This failure looks like it isn't happening now.
,
Oct 9
Sigh, I think it's still happening occasionally. See ui.VirtualKeyboardOmnibox in http://stainless/browse/chromeos-autotest-results/246823430-chromeos-test/: 2018/10/09 08:09:20 Started test ui.SupervisedUserCrash 2018/10/09 08:09:20 [08:09:18.903] Restarting ui job 2018/10/09 08:09:22 [08:09:21.050] Waiting for org.chromium.SessionManager D-Bus service 2018/10/09 08:09:22 [08:09:21.077] Asking session_manager to enable Chrome testing 2018/10/09 08:09:22 [08:09:21.080] Waiting for Chrome to write its debugging port to /home/chronos/DevToolsActivePort 2018/10/09 08:09:23 [08:09:21.966] Checking cryptohomed service 2018/10/09 08:09:23 [08:09:21.971] Removing cryptohome for testuser@gmail.com 2018/10/09 08:09:23 [08:09:22.064] Finding OOBE DevTools target 2018/10/09 08:09:23 [08:09:22.169] Connecting to Chrome at ws://127.0.0.1:45466/devtools/page/9490D8403A4A2ABA023EFD5C5A24E577 2018/10/09 08:09:23 [08:09:22.474] Waiting for OOBE 2018/10/09 08:09:27 [08:09:26.057] Logging in as user "testuser@gmail.com" 2018/10/09 08:09:27 [08:09:26.329] Waiting for cryptohome for user "testuser@gmail.com" 2018/10/09 08:09:28 [08:09:27.506] Waiting for OOBE to be dismissed 2018/10/09 08:09:30 [08:09:29.367] Waiting for org.chromium.SessionManager D-Bus service 2018/10/09 08:09:30 Completed test ui.SupervisedUserCrash in 10.598s with 0 error(s) 2018/10/09 08:09:30 Started test ui.VirtualKeyboardOmnibox 2018/10/09 08:09:30 [08:09:29.501] Restarting ui job 2018/10/09 08:09:30 [08:09:29.659] External command failed: exit status 1 2018/10/09 08:09:30 [08:09:29.659] Command: 'initctl' 'stop' 'ui' 2018/10/09 08:09:30 [08:09:29.660] Uncaptured output: initctl: Job has already been stopped: ui ... 2018/10/09 08:09:38 [08:09:36.983] Error at virtual_keyboard_omnibox.go:27: Failed to start Chrome: stopping "ui" failed: status start/post-stop has non-"stop" goal --- 2018-10-09T15:09:29.508978+00:00 INFO session_manager[9644]: [INFO:session_manager_service.cc(214)] SessionManagerService exiting 2018-10-09T15:09:29.517392+00:00 WARNING session_manager[9644]: [WARNING:session_manager_main.cc(262)] session_manager exiting with code 1 2018-10-09T15:09:29.519597+00:00 WARNING kernel: [ 373.565192] init: ui main process (9644) terminated with status 1 2018-10-09T15:09:29.536097+00:00 INFO crash_reporter[10101]: libminijail[3]: mount '/dev/log' -> '/dev/log' type '' flags 0x1001 2018-10-09T15:09:29.545526+00:00 NOTICE ui-respawn[10110]: ui failed with exit status 1. 2018-10-09T15:09:29.566239+00:00 NOTICE ui-respawn[10124]: Respawning ui. ... 2018-10-09T15:09:37.788104+00:00 WARNING session_manager[10304]: [WARNING:chrome_setup.cc(92)] Could not find both paths: /usr/share/chromeos-assets/wallpaper/oem_large.jpg and /usr/share/chromeos-assets/wallpaper/oem_small.jpg 2018-10-09T15:09:37.794810+00:00 INFO session_manager[10304]: [INFO:session_manager_service.cc(152)] SessionManagerService starting
,
Oct 9
,
Oct 9
Here's my attempt to piece together what happened:
15:09:29.501 RestartJob called
waitUIJobStabilized returns, probably with
ui start/running and ui-respawn stop/waiting
15:09:29.520 session_manager process exits with status 1
ui job stops
StopJob's "initctl stop ui" call fails
15:09:29.545 ui-respawn is running
15:09:29.566 ui-respawn starts ui job
15:09:29.659 StopJob's WaitForJobStatus call fails due to
ui job having "start" goal
,
Oct 9
Uploaded https://crrev.com/c/1271940 as another speculative fix.
,
Oct 9
,
Oct 11
The following revision refers to this bug: https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/95e8fca18a87742b464ececff7c8e50a397a806a commit 95e8fca18a87742b464ececff7c8e50a397a806a Author: Daniel Erat <derat@chromium.org> Date: Thu Oct 11 00:08:04 2018 tast-tests: Speculative fix for ui job flakiness. Make upstart.StopJob's special code for handling the "ui" job run "initctl stop ui" before waiting for the job's status to stabilize. Otherwise, there seems to be a race where waitUIJobStabilized sometimes sees the initial "start/running" state even though session_manager is already in the process of exiting. BUG= chromium:891594 TEST=no failures when running ui.SupervisedUserCrash ui.VirtualKeyboardOmnibox, but i'm already unable to reproduce the sporadic failures that are currently seen on release builders, so this is just speculative Change-Id: I1219fda45c19a9bd0ed03f1a24ffb9e879d03578 Reviewed-on: https://chromium-review.googlesource.com/1271940 Commit-Ready: Dan Erat <derat@chromium.org> Tested-by: Dan Erat <derat@chromium.org> Reviewed-by: Shuhei Takahashi <nya@chromium.org> [modify] https://crrev.com/95e8fca18a87742b464ececff7c8e50a397a806a/src/chromiumos/tast/local/upstart/upstart.go
,
Oct 12
ui.VirtualKeyboardOmnibox is still failing flakily, but it's due to other issues (in the test itself?) now -- I'm not seeing the ui job errors caused by ui.SupervisedUserCrash running beforehand now.
,
Oct 12
Issue 860519 has been merged into this issue. |
|||||||||||||
►
Sign in to add a comment |
|||||||||||||
Comment 1 by sammiequon@chromium.org
, Oct 3