New issue
Advanced search Search tips
Note: Color blocks (like or ) mean that a user may not be available. Tooltip shows the reason.

Issue 876196 link

Starred by 2 users

Issue metadata

Status: Fixed
Owner:
Closed: Sep 17
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: ----

Blocking:
issue 883875



Sign in to add a comment

wizpig-paladin , winky-paladin failed: login_LoginSuccess: Timed out waiting for condition: Session stopped.

Project Member Reported by sheriff-...@appspot.gserviceaccount.com, Aug 21

Issue description

Filed by sheriff-o-matic@appspot.gserviceaccount.com on behalf of tfiga@chromium.org

wizpig-paladin:3846 failed

login_LoginSuccess: retry_count: 2, ERROR: Timed out waiting for condition: Session stopped.

Builders failed on: 
- wizpig-paladin: 
  https://cros-goldeneye.corp.google.com/chromeos/healthmonitoring/buildDetails?buildbucketId=8937626561988762208

Same failure on 3844 and 3845:
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/3844
https://luci-milo.appspot.com/buildbot/chromeos/wizpig-paladin/3845

08/20 12:00:16.622 INFO |    cros_interface:0575| (Re)starting the ui (logs the user out)
08/20 12:00:16.624 DEBUG|      global_hooks:0056| ['sh', '-c', 'systemctl']
08/20 12:00:16.639 DEBUG|      global_hooks:0056| ['sh', '-c', 'systemctl']
08/20 12:00:16.653 DEBUG|      global_hooks:0056| ['sh', '-c', 'status ui']
08/20 12:00:16.672 DEBUG|    cros_interface:0454| IsServiceRunning(ui)->True
08/20 12:00:16.673 DEBUG|    cros_interface:0058| sh -c restart ui 
08/20 12:00:16.673 DEBUG|      global_hooks:0056| ['sh', '-c', 'restart ui']
08/20 12:00:39.979 DEBUG|    cros_interface:0067|  > stdout=[ui start/running, process 22584
], stderr=[]
08/20 12:00:39.980 DEBUG|    cros_interface:0058| sh -c cryptohome-path user 'test@test.test' 
08/20 12:00:39.980 DEBUG|      global_hooks:0056| ['sh', '-c', "cryptohome-path user 'test@test.test'"]
08/20 12:00:40.016 DEBUG|    cros_interface:0067|  > stdout=[/home/user/580bcfa450e2f41f4333b0e8067f8fd279423bfa
], stderr=[]
08/20 12:00:40.018 DEBUG|    cros_interface:0058| sh -c /bin/df --output=source,target /run/cryptohome/ephemeral_mount/580bcfa450e2f41f4333b0e8067f8fd279423bfa 
08/20 12:00:40.018 DEBUG|      global_hooks:0056| ['sh', '-c', '/bin/df --output=source,target /run/cryptohome/ephemeral_mount/580bcfa450e2f41f4333b0e8067f8fd279423bfa']
08/20 12:00:40.039 DEBUG|    cros_interface:0067|  > stdout=[], stderr=[df: /run/cryptohome/ephemeral_mount/580bcfa450e2f41f4333b0e8067f8fd279423bfa: No such file or directory
]
08/20 12:00:40.040 DEBUG|    cros_interface:0058| sh -c cryptohome-path user 'test@test.test' 
08/20 12:00:40.040 DEBUG|      global_hooks:0056| ['sh', '-c', "cryptohome-path user 'test@test.test'"]
08/20 12:00:40.058 DEBUG|    cros_interface:0067|  > stdout=[/home/user/580bcfa450e2f41f4333b0e8067f8fd279423bfa
], stderr=[]
08/20 12:00:40.059 DEBUG|    cros_interface:0058| sh -c /bin/df --output=source,target /home/user/580bcfa450e2f41f4333b0e8067f8fd279423bfa 
08/20 12:00:40.059 DEBUG|      global_hooks:0056| ['sh', '-c', '/bin/df --output=source,target /home/user/580bcfa450e2f41f4333b0e8067f8fd279423bfa']
08/20 12:00:40.080 DEBUG|    cros_interface:0067|  > stdout=[Filesystem     Mounted on
/dev/mmcblk0p1 /home
], stderr=[]
08/20 12:00:40.081 INFO |           browser:0221| Browser is closed.
08/20 12:01:40.023 ERROR|             utils:2767| Timed out waiting for condition: Session stopped.
08/20 12:01:40.034 DEBUG|              test:0381| Test failed due to Timed out waiting for condition: Session stopped.. Exception log follows the after_iteration_hooks.
 
Cc: gwendal@chromium.org pprabhu@chromium.org ihf@chromium.org sbasi@chromium.org
Labels: OS-Chrome
Owner: pprabhu@chromium.org
Summary: wizpig-paladin , winky-paladin failed: login_LoginSuccess: Timed out waiting for condition: Session stopped. (was: wizpig-paladin:3846 failed: login_LoginSuccess: Timed out waiting for condition: Session stopped.)
Same failure also happened on winky-paladin:
https://luci-milo.appspot.com/buildbot/chromeos/winky-paladin/5891
https://luci-milo.appspot.com/buildbot/chromeos/winky-paladin/5892
https://luci-milo.appspot.com/buildbot/chromeos/winky-paladin/5893
Owner: tfiga@chromium.org
Status: Untriaged (was: Available)
I don't see how I am the right owner for this bug.
This should be handled initially by the chrome gardner or the sheriff to figure out which component of ChromeOS is at fault.

The initial report should at least contain the stack trace reported in the test logs:

Traceback (most recent call last):
  File "/usr/local/autotest/common_lib/test.py", line 600, in _exec
    _call_test_function(self.execute, *p_args, **p_dargs)
  File "/usr/local/autotest/common_lib/test.py", line 800, in _call_test_function
    return func(*args, **dargs)
  File "/usr/local/autotest/common_lib/test.py", line 464, in execute
    postprocess_profiled_run, args, dargs)
  File "/usr/local/autotest/common_lib/test.py", line 371, in _call_run_once
    self.run_once(*args, **dargs)
  File "/usr/local/autotest/tests/login_LoginSuccess/login_LoginSuccess.py", line 58, in run_once
    timeout=self._SESSION_STOP_TIMEOUT)
  File "/usr/local/autotest/common_lib/cros/session_manager.py", line 55, in wait_for_signals
    timeout=timeout)
  File "/usr/local/autotest/common_lib/utils.py", line 2768, in poll_for_condition
    raise TimeoutError(message=desc)
TimeoutError: Timed out waiting for condition: Session stopped.

Next step is to figure out from chrome / ui/ arc++ logs why the session was stopped on the DUT.
This is a ChromeOS bug.
Cc: akhouderchah@chromium.org bleung@chromium.org
Owner: achuith@chromium.org
Status: Assigned (was: Untriaged)
+other sheriffs

Prathmesh, I'm the sheriff this week and you're the person who modified this test last time. If you know a better way of finding the right people to investigate test failures, please let me know. (I missed achuith@ among the other, inactive authors of this test, though. Some tests don't even seem to have any active authors anymore.)

Also, the trace is just a generic timeout trace from autotest and is not very useful, so I went with direct log from the test on the DUT.

achuith@, among the 3 authors listed in control files, you're the only one still in Chrome OS. Do you think you could take a look at this?
Cc: achuith@chromium.org
Owner: ----
Status: Untriaged (was: Assigned)
I don't think it's a problem with the test. Looks like the session manager died? 

As noted above, the sheriff should handle this since it seems like a regression.

Did you look at the logs to see what's happening? Is this a reliable failure that can be bisected?
Cc: derat@chromium.org apronin@chromium.org
Fixing regressions is not a part of sheriff's responsibilities... I can quickly look through the list of CLs, but I have completely no idea about session manager or this particular test, so if there isn't anything obvious, I can't do too much. The main responsibility of the sheriff is to catch failures, file bugs with any potentially useful details picked from the builders and delegate to appropriate owners.

+derat@ and apronin@, since they seem to have made some progress on investigating another session manager  issue 876197 .
https://storage.cloud.google.com/chromeos-autotest-results/229433642-chromeos-test/chromeos6-row2-rack19-host1/login_LoginSuccess/sysinfo/var/log_diff/messages

I see messages like:
2018-08-20T18:15:14.469112+00:00 WARNING crash_reporter[7824]: [user] Received crash notification for chrome[7278] sig 6, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)
2018-08-20T18:15:14.472614+00:00 WARNING crash_reporter[7825]: Could not load the device policy file.
2018-08-20T18:15:14.472854+00:00 WARNING crash_reporter[7825]: [user] Received crash notification for chrome[7232] sig 6, user 1000 group 1000 (ignoring call by kernel - chrome crash; waiting for chrome to call us directly)

I don't know if this is a problem with chrome crashing?
Components: OS>Systems
My usual approach is to delete flaky tests that nobody wants to own, or at least remove them from bvt-inline.

I'm not familiar with this test (achuith@ is the lucky person left in its AUTHOR attribute... :-P), but it looks like it logs in, starts listening for SessionStateChanged D-Bus signals, and then logs out and hopes to see a "stopped" signal.

Here's the failure from login_LoginSuccess.INFO in http://stainless/browse/chromeos-autotest-results/229425820-chromeos-test/:

08/20 11:10:58.623 INFO |           browser:0208| Closing browser (pid=6498) ...
08/20 11:10:58.626 INFO |    cros_interface:0575| (Re)starting the ui (logs the user out)
08/20 11:11:20.630 INFO |           browser:0221| Browser is closed.
08/20 11:12:20.581 ERROR|             utils:2767| Timed out waiting for condition: Session stopped.

And here's what the messages log says:

...
2018-08-20T18:10:58.750307+00:00 INFO session_manager[6419]: [INFO:session_manager_impl.cc(388)] Emitting D-Bus signal SessionStateChanged:stopping
2018-08-20T18:10:58.752037+00:00 INFO session_manager[6419]: [INFO:browser_job.cc(167)] Terminating browser process 6498 with signal 15: exiting cleanly
2018-08-20T18:10:58.752616+00:00 INFO session_manager[6419]: [INFO:system_utils_impl.cc(94)] Sending 15 to 6498 as 1000
2018-08-20T18:10:58.752958+00:00 INFO session_manager[6419]: [INFO:system_utils_impl.cc(94)] Sending 15 to 6685 as 1000
2018-08-20T18:10:58.909715+00:00 INFO chrome[6498]: [6498:6498:0820/111058.909620:INFO:remote_commands_invalidator.cc(45)] Shutdown RemoteCommandsInvalidator.
2018-08-20T18:10:58.915966+00:00 INFO chrome[6498]: [6498:6498:0820/111058.915896:INFO:remote_commands_invalidator.cc(68)] Stopping RemoteCommandsInvalidator.
2018-08-20T18:10:58.916179+00:00 INFO chrome[6498]: [6498:6498:0820/111058.916142:INFO:remote_commands_invalidator.cc(167)] Unregister RemoteCommandsInvalidator.
2018-08-20T18:11:18.771862+00:00 WARNING kernel: [ 1386.003086] init: ui main process (6419) killed by KILL signal

So it looks like the "stopping" state is sent, but then session_manager receives SIGKILL before it emits "stopped". And SIGKILL comes 20 seconds after Chrome gets SIGTERM, which seems like way too long.

void SessionManagerService::CleanupChildren(base::TimeDelta timeout,
                                            ExitCode code) {
  std::string reason = ExitCodeToString(code);
  browser_->Kill(SIGTERM, reason);
  key_gen_.RequestJobExit(reason);
  android_container_->RequestJobExit(
      code == ExitCode::SUCCESS
          ? ArcContainerStopReason::SESSION_MANAGER_SHUTDOWN
          : ArcContainerStopReason::BROWSER_SHUTDOWN);
  browser_->WaitAndAbort(timeout);
  key_gen_.EnsureJobExit(timeout);
}

The "Terminating browser process 6498 with signal 15" comes from browser_->Kill, so we at least made it that far.
So, same thing as  https://crbug.com/876197#c4 ? Occasionally chrome/session_manager takes too long to properly shutdown, and is force-killed instead. 

Sounds like an actual issue (not sure how important, though), which that test helped to uncover. The waiting-for-stopped logic also looked sound after a quick glance (it first starts to wait and then asks to shutdown the session, so no race). Maybe don't delete the test yet? :)
Cc: xutan@chromium.org yusukes@chromium.org cmtm@chromium.org lhchavez@chromium.org
Here's the code that calls CleanupChildren:

void SessionManagerService::SetExitAndScheduleShutdown(ExitCode code) {
  shutting_down_ = true;
  exit_code_ = code;
  impl_->AnnounceSessionStoppingIfNeeded();

  child_exit_dispatcher_.reset();
  liveness_checker_->Stop();
  CleanupChildren(GetKillTimeout(), code);
  impl_->AnnounceSessionStopped();

  brillo::MessageLoop::current()->PostTask(
      FROM_HERE, base::Bind(&brillo::MessageLoop::BreakLoop,
                            base::Unretained(brillo::MessageLoop::current())));
  LOG(INFO) << "SessionManagerService quitting run loop";
}

AnnounceSessionStopped logs a "Emitting D-Bus signal SessionStateChanged:stopped" message that we didn't see here (and it not running is the reason for the test failure), so it looks like session_manager got stuck in CleanupChildren. Looking through CleanupChildren:

- BrowserJob::Kill doesn't block.
- KeyGenerator::RequestJobExit doesn't block.
- AndroidOciWrapper::RequestJobExit runs "/usr/bin/run_oci --signal=KILL kill <id>" synchronously. From the logs, I don't think that ARC was even running here, so AndroidOciWrapper::RequestJobExit seems unlikely to be the culprit.
- BrowserJob::WaitAndAbort at least attempts to block for no more than 3 seconds. If it returned, it would've logged either "Cleaned up browser process" or "Aborting browser process", so I think it's safe to say we didn't get past here.

Here's WaitAndAbort:

void BrowserJob::WaitAndAbort(base::TimeDelta timeout) {
  const pid_t pid = subprocess_->GetPid();
  if (pid < 0)
    return;

  if (!system_->ProcessGroupIsGone(pid, timeout)) {
    LOG(WARNING) << "Aborting browser process " << pid << "'s process group "
                 << timeout.InSeconds() << " seconds after sending signal";
    std::string message = base::StringPrintf("Browser took more than %" PRId64
                                             " seconds to exit after signal.",
                                             timeout.InSeconds());
    KillEverything(SIGABRT, message);
  } else {
    DLOG(INFO) << "Cleaned up browser process " << pid;
  }
}

SystemUtilsImpl::ProcessGroupIsGone seems like the likely culprit since it's the only blocking operation before the log messages that we didn't see. This method is tricky and has had at least one race in the past ( issue 754076 ).

This comment is long enough, so I'll spend some time staring at the code.
Cc: mnissler@chromium.org
I uploaded https://crrev.com/c/1186107 to make session_manager not call waitpid() so quickly and to improve its logging. I couldn't find any bugs in ProcessGroupIsGone that would result in it hanging, though. Additional eyes on the code might help.

And note that in addition to session_manager apparently hanging instead of aborting Chrome, it may be the case that Chrome didn't exit when it received SIGTERM.
Looking at http://stainless/browse/chromeos-image-archive/betty-paladin/R70-10993.0.0-rc3/vm_test_results_1/smoke/test_harness/all/SimpleTestVerify/1_autotest_tests/results-33-login_LoginSuccess/ (the betty-paladin run from #12), the timestamps in the messages log that Autotest collected don't even line up with the test failure:

08/22 23:35:18.537 INFO |    cros_interface:0575| (Re)starting the ui (logs the user out)
08/22 23:35:41.166 INFO |           browser:0221| Browser is closed.
08/22 23:36:41.166 ERROR|             utils:2767| Timed out waiting for condition: Session stopped.

Whereas messages goes from 2018-08-23T03:52:05.756243+00:00 to 2018-08-23T04:03:32.051352+00:00. Hard to say for sure that it's the same cause.
Project Member

Comment 15 by bugdroid1@chromium.org, Aug 29

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/4ce3b598d951eb4082eade187e9704e402afc4da

commit 4ce3b598d951eb4082eade187e9704e402afc4da
Author: Daniel Erat <derat@chromium.org>
Date: Wed Aug 29 18:15:25 2018

login: Sleep for 10 ms between waitpid() calls.

Make session_manager's SystemUtilsImpl::Wait method sleep
for 10 milliseconds between waitpid calls instead of calling
base::PlatformThread::YieldCurrentThread. Using
YieldCurrentThread can result in hundreds of thousands of
waitpid calls while Chrome is exiting, which seems
unnecessary at best.

Also make logging clearer to help in debugging
hang-during-signout issues: log a new "Waiting up to 3
seconds for ___'s process group to exit" message just before
waiting for Chrome to exit, and remove a "Will wait 3s for
graceful browser exit." message that was confusingly logged
at startup.

BUG= chromium:876196 
TEST=manual: normal chrome-initiated signout still works;
     also verified that session_manager still sends SIGABRT
     chrome when i send SIGSTOP to chrome and then make a
     StopSession d-bus method call to session_manager

Change-Id: I5d8a8700032c6fe8deae613615f7d4a54b3a458f
Reviewed-on: https://chromium-review.googlesource.com/1186107
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Yusuke Sato <yusukes@chromium.org>

[modify] https://crrev.com/4ce3b598d951eb4082eade187e9704e402afc4da/login_manager/session_manager_main.cc
[modify] https://crrev.com/4ce3b598d951eb4082eade187e9704e402afc4da/login_manager/system_utils_impl.cc
[modify] https://crrev.com/4ce3b598d951eb4082eade187e9704e402afc4da/login_manager/system_utils.h
[modify] https://crrev.com/4ce3b598d951eb4082eade187e9704e402afc4da/login_manager/browser_job.cc

Status: WontFix (was: Untriaged)
I don't see any of these failures on -release builders after R70-10993.0.0-rc4 on 2018-08-23: https://stainless.corp.google.com/search?view=matrix&row=model&col=build&first_date=2018-08-21&last_date=2018-09-03&builder_name=-paladin%24&test=^login_LoginSuccess%24&status=GOOD&status=WARN&status=FAIL&status=ERROR&exclude_cts=false&exclude_not_run=false&exclude_non_release=false&exclude_au=true&exclude_acts=true&exclude_retried=true&exclude_non_production=true

#15 didn't go in until 08-29, so I'm not sure that it could've been what helped.

These failures look like they were happening frequently before, though, and there haven't been any in the last 10 days, so I'm closing this out.
Status: Untriaged (was: WontFix)
We're having a rash of these again in the CQ right now.
Can you link to some of the failures here?
Components: Infra>Client>ChromeOS>CI Infra>Client>ChromeOS>Test
Labels: -Pri-2 Pri-1
auron_yuna and cyan show the same *pattern* of failing builds, but a different failure:

cheets_Midis: retry_count: 2, FAIL: midis didn't restart.
Thanks. The logs for wolf appear to be missing, but here's what I see in the winky logs at http://stainless/browse/chromeos-autotest-results/237499877-chromeos-test/ :

09/13 09:11:22.821 INFO |           browser:0208| Closing browser (pid=24902) ...
09/13 09:11:22.839 INFO |    cros_interface:0575| (Re)starting the ui (logs the user out)
09/13 09:11:44.791 INFO |           browser:0221| Browser is closed.
09/13 09:12:44.745 ERROR|             utils:2769| Timed out waiting for condition: Session stopped.

2018-09-13T16:11:22.993407+00:00 INFO session_manager[24839]: [INFO:session_manager_impl.cc(383)] Emitting D-Bus signal SessionStateChanged:stopping
2018-09-13T16:11:22.994715+00:00 INFO session_manager[24839]: [INFO:browser_job.cc(167)] Terminating browser process 24902 with signal 15: exiting cleanly
2018-09-13T16:11:22.994904+00:00 INFO session_manager[24839]: [INFO:system_utils_impl.cc(94)] Sending 15 to 24902 as 1000
2018-09-13T16:11:22.998046+00:00 INFO session_manager[24839]: [INFO:system_utils_impl.cc(94)] Sending 15 to 25105 as 1000
2018-09-13T16:11:22.998254+00:00 INFO session_manager[24839]: [INFO:browser_job.cc(177)] Waiting up to 60 seconds for 24902's process group to exit
2018-09-13T16:11:23.207940+00:00 INFO chrome[24902]: [24902:24902:0913/091123.207861:INFO:remote_commands_invalidator.cc(45)] Shutdown RemoteCommandsInvalidator.
2018-09-13T16:11:23.208054+00:00 INFO chrome[24902]: [24902:24902:0913/091123.208018:INFO:remote_commands_invalidator.cc(68)] Stopping RemoteCommandsInvalidator.
2018-09-13T16:11:23.208120+00:00 INFO chrome[24902]: [24902:24902:0913/091123.208085:INFO:remote_commands_invalidator.cc(167)] Unregister RemoteCommandsInvalidator.
2018-09-13T16:11:23.509709+00:00 WARNING cryptohomed[1266]: RemoveKeyset: key to remove not found
2018-09-13T16:11:43.021102+00:00 WARNING kernel: [ 1658.399765] init: ui main process (24839) killed by KILL signal

So if I'm reading this correctly:

a) The test waits a minute for the SessionStateChanged "stopped" D-Bus signal.
b) session_manager gives Chrome's process group a minute to exit.
c) init sends KILL to session_manager 20 seconds after it's been instructed to stop.

The delay for b) looks like it's coming from session_manager_service.cc:

// TODO(mkrebs): Remove CollectChrome timeout and file when
// crosbug.com/5872 is fixed.
// When crash-reporter based crash reporting of Chrome is enabled
// (which should only be during test runs) we use
// kKillTimeoutCollectChrome instead of the kill timeout specified at
// the command line.
const int SessionManagerService::kKillTimeoutCollectChrome = 60;
const char SessionManagerService::kCollectChromeFile[] =
    "/mnt/stateful_partition/etc/collect_chrome_crashes";

...


base::TimeDelta SessionManagerService::GetKillTimeout() {
  if (base::PathExists(base::FilePath(kCollectChromeFile)))
    return base::TimeDelta::FromSeconds(kKillTimeoutCollectChrome);
  else
    return kill_timeout_;
}

---

 https://crbug.com/187808  (the bug mentioned in that TODO) looks like it's been closed since 2013, so maybe that code can be deleted now so that session_manager uses a reasonable delay before killing Chrome?
Owner: derat@chromium.org
Status: Started (was: Untriaged)
Deleting old buggy code is my favorite thing, so I'll take this.
The midis restart failure looks plausibly related: https://chromium.googlesource.com/chromiumos/third_party/autotest/+/master/client/site_tests/cheets_Midis/cheets_Midis.py?pli=1#70

derat@: Can you take a look at this as well? I can imagine removing the timeout would help here too.
I will happily hope that fixing the session_manager timeout will fix other timeout-related failures as well, but if it doesn't, someone who knows something about midis is going to need to investigate that one. :-)
Cc: vapier@chromium.org
Code spelunking: the one-minute delay in session_manager dates back to 2011, in https://crrev.com/c/5289. It was tracked by  https://crbug.com/201001 . I haven't connected all of the dots here yet, but it seems like the one-minute delay was initially meant to be used in cases where crash-reporter is responsible for reporting Chrome crashes.

//src/third_party/autotest/files/client/common_lib/cros/chrome.py still touches collect_chrome_crashes (so that core files will be generated when Chrome crashes, I think). Was the delay added because Chrome needs more time to write those files?

It doesn't seem useful for session_manager to wait a minute for Chrome if session_manager is going to be killed after 20 seconds. Unless anyone tells me not to, I'm going to reduce SessionManagerService::kKillTimeoutCollectChrome to 12 seconds so it'll complete before session_manager is killed. I'm choosing 12 to match kKillTimeoutLongSeconds in session_manager_main.cc, which we use as the timeout on systems with rotating disks (instead of the default, 3 seconds).

I'll reiterate that the core problem is probably that Chrome isn't exiting properly.
when collect_chrome_crashes is active, then yes, coredumps will be collected & saved for Chrome (vs just being read in memory & discarded) which can take sometime to read+write to disk.  the exact mechanics don't matter that much (who exactly is doing the collection) as the files will be large regardless.

if we're seeing early timeouts on bots, then imo that just says the timeouts are too low here.  make them like 5min when testing.  or we turn off the coredump collection ... i'm dubious how useful these actually are to test systems the majority of the time beyond the existing minidump crash reports.
Chrome doesn't appear to have crashed in the logs that I quoted above, so I'm not convinced that these delays should be increased or that disabling core dump collection would help. I think the important thing here is that session_manager kills Chrome before Upstart kills session_manager, and it's easier to tweak the delay on the session_manager side than on the Upstart side.
maybe we could improve things whereby crash-reporter would create a flag file when it's actively collecting a crash, and session manager (in general) could look for that in its shutdown phase.  if it sees the file, it would increase its gracetime significantly.  maybe we create /run/crash_reporter/collection-inprogress/ and session manager can simply check if the dir is missing/empty.
> I'll reiterate that the core problem is probably that Chrome isn't exiting properly.

derat@: Are you working on the underlying issue as well?
Blocking: 883875
Nope, and I don't even know if there's information that could be used to investigate that right now. The Chrome log in the results dir I linked to above just ends with this:

[24902:24902:0913/091123.207861:INFO:remote_commands_invalidator.cc(45)] Shutdown RemoteCommandsInvalidator.
[24902:24902:0913/091123.208018:INFO:remote_commands_invalidator.cc(68)] Stopping RemoteCommandsInvalidator.
[24902:24902:0913/091123.208085:INFO:remote_commands_invalidator.cc(167)] Unregister RemoteCommandsInvalidator.

We might get a Chrome crash that someone can look at after I reduce the timeout, but if so, please point the gardener at it.
Jumping in as on-call today,  crbug.com/883875  (blocked by this bug) is blocking CQ. I've marked it P0. Any updates on this bug?
Project Member

Comment 34 by bugdroid1@chromium.org, Sep 14

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/11f7b9712f6807e31087d10e7da2e29287385f3e

commit 11f7b9712f6807e31087d10e7da2e29287385f3e
Author: Daniel Erat <derat@chromium.org>
Date: Fri Sep 14 19:08:47 2018

login: Reduce collect_chrome_crashes timeout to 12 seconds.

Change the time that session_manager gives Chrome to exit
when /mnt/stateful_partition/etc/collect_chrome_crashes is
set to be 12 seconds rather than a minute.

The ui Upstart job specifies that session_manager should be
sent SIGKILL after 20 seconds, and the longer delay in
session_manager appears to be resulting in some Autotest
tests failing when session_manager is killed before it can
emit a SessionStateChanged "stopped" D-Bus signal while
waiting for a hanging Chrome process.

BUG= chromium:876196 
TEST=none

Change-Id: I1239cd41e381a926f97adcfb088084ef788959ef
Reviewed-on: https://chromium-review.googlesource.com/1224644
Commit-Ready: Dan Erat <derat@chromium.org>
Tested-by: Dan Erat <derat@chromium.org>
Reviewed-by: Jorge Lucangeli Obes <jorgelo@chromium.org>
Reviewed-by: Mike Frysinger <vapier@chromium.org>

[modify] https://crrev.com/11f7b9712f6807e31087d10e7da2e29287385f3e/login_manager/session_manager_service.cc
[modify] https://crrev.com/11f7b9712f6807e31087d10e7da2e29287385f3e/login_manager/session_manager_service.h

Sign in to add a comment