New issue
Advanced search Search tips

Issue 882244 link

Starred by 1 user

Issue metadata

Status: Fixed
Owner:
Closed: Oct 3
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 2
Type: Bug



Sign in to add a comment

ui.SessionManagerRespawn failed with "Job is already running: ui" due to unkillable Chrome processes

Project Member Reported by derat@chromium.org, Sep 9

Issue description

In the tricky-release run at http://stainless/browse/chromeos-autotest-results/235396368-chromeos-test/, the ui.SessionManagerRespawn test failed while trying to restart the ui job at completion:

2018/09/07 23:36:23 Started test ui.SessionManagerRespawn
2018/09/07 23:36:23 [23:36:22.567] Getting initial session_manager process
2018/09/07 23:36:23 [23:36:22.568] Initial session_manager process is 7871
2018/09/07 23:36:23 [23:36:22.568] Killing 7871
2018/09/07 23:36:23 [23:36:22.568] Waiting for session_manager to respawn
2018/09/07 23:36:24 [23:36:24.223] Respawned session_manager process is 8671
2018/09/07 23:36:24 [23:36:24.223] Repeatedly killing session_manager to check that ui-respawn stops restarting it eventually
2018/09/07 23:36:24 [23:36:24.223] Killing 8671 and watching for respawn
2018/09/07 23:36:38 [23:36:37.770] session_manager (correctly) not respawned
2018/09/07 23:36:38 [23:36:37.778] ui job not running; starting it
2018/09/07 23:36:38 [23:36:37.785] External command failed: exit status 1
2018/09/07 23:36:38 [23:36:37.785] Command: 'initctl' 'start' 'ui'
2018/09/07 23:36:38 [23:36:37.785] Uncaptured output:
initctl: Job is already running: ui
2018/09/07 23:36:38 [23:36:37.785] Error at session_manager_respawn.go:74: Failed to ensure ui job is running: restarting job "ui" failed: exit status 1

Here's the end of the 'messages' log:

2018-09-07T23:36:24.206218-07:00 INFO session_manager[8671]: [INFO:session_manager_service.cc(159)] SessionManagerService starting
2018-09-07T23:36:24.210664-07:00 INFO session_manager[8671]: [INFO:browser_job.cc(148)] Running browser /opt/google/chrome/chrome ...
2018-09-07T23:36:24.224306-07:00 WARNING kernel: [ 2460.165363] init: ui main process (8671) killed by KILL signal
2018-09-07T23:36:24.318124-07:00 NOTICE ui-respawn[8744]: ui died on signal KILL.
2018-09-07T23:36:24.361973-07:00 NOTICE ui-respawn[8798]: Respawning ui.
2018-09-07T23:36:37.864443-07:00 CRIT ui-post-stop-unkillable[8936]:  8683 Z<sl [chrome] <defunct>
2018-09-07T23:36:37.864462-07:00 CRIT ui-post-stop-unkillable[8936]:  8783 S    [chrome]
2018-09-07T23:36:37.864470-07:00 CRIT ui-post-stop-unkillable[8936]:  8822 D    [nacl_helper]
2018-09-07T23:36:37.864650-07:00 CRIT ui-post-stop-unkillable[8936]:  8825 D    [nacl_helper_non]
2018-09-07T23:36:37.864694-07:00 CRIT ui-post-stop-unkillable[8936]:  8856 Z<   [chrome] <defunct>
2018-09-07T23:36:37.864730-07:00 CRIT ui-post-stop-unkillable[8936]:  8899 Zl   [chrome] <defunct>
2018-09-07T23:36:37.864766-07:00 CRIT ui-post-stop-unkillable[8936]:  8908 Zl   [chrome] <defunct>
2018-09-07T23:36:37.864800-07:00 CRIT ui-post-stop-unkillable[8936]:  8911 Zl   [chrome] <defunct>

I see a ui-post-stop-unkillable message in issue 625159, although I think that my conclusion there that session_manager was hanging during keygen was probably incorrect (due to confusing logging in session_manager at the time).

I should probably make this test use faillog, as there's a chance that seeing a full process list would've provided more clues about what went wrong.

I should also make upstart.EnsureJobRunning log more information about the job's state when RestartJob fails. I suspect that the job was actually stuck in the "stopping" state when we tried to start it -- we should possibly wait a bit here.
 
2018/09/07 23:36:24 [23:36:24.223] Killing 8671 and watching for respawn
2018/09/07 23:36:38 [23:36:37.770] session_manager (correctly) not respawned

Taking 13 seconds before timing out is weird despite we set respawnTimeout = 5 * time.Second. I believe WaitForProc() and getPID() are both lightweight on Linux systems... 

Project Member

Comment 2 by bugdroid1@chromium.org, Sep 18

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform/tast-tests/+/c58ca8a1d11871c21a742e98c07af089b013ff45

commit c58ca8a1d11871c21a742e98c07af089b013ff45
Author: Daniel Erat <derat@chromium.org>
Date: Tue Sep 18 21:04:25 2018

tast-tests: Fix races in upstart package.

Rework the upstart package to be more robust when dealing
with jobs that aren't in the expected state. In particular,
the package is now aware of jobs' goals and polls to make
sure that they've actually reached the requested state.

BUG= chromium:882244 
TEST=added unit test; also verified that tests that use the
     upstart package still pass

Change-Id: Ie8eea427946a6c1f5e78305600918a9e8c3dd15f
Reviewed-on: https://chromium-review.googlesource.com/1223690
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Shuhei Takahashi <nya@chromium.org>

[add] https://crrev.com/c58ca8a1d11871c21a742e98c07af089b013ff45/src/chromiumos/tast/local/upstart/upstart_test.go
[modify] https://crrev.com/c58ca8a1d11871c21a742e98c07af089b013ff45/src/chromiumos/tast/local/bundles/cros/platform/check_processes.go
[modify] https://crrev.com/c58ca8a1d11871c21a742e98c07af089b013ff45/src/chromiumos/tast/local/upstart/upstart.go

Status: Started (was: Assigned)
I'm going to wait a week or two to see if there are any more Upstart-related failures.
Status: Fixed (was: Started)
I don't think I've seen this problem anymore, but there appears to still be a race that can result in failures with:

  Failed to start Chrome: stopping "ui" failed: status start/post-stop has non-"stop" goal

I'm using  issue 891594  to track it.

Sign in to add a comment