New issue
Advanced search Search tips

Issue 883029 link

Starred by 2 users

Issue metadata

Status: Assigned
Owner:
Cc:
Components:
EstimatedDays: ----
NextAction: ----
OS: Chrome
Pri: 1
Type: Bug

Blocked on:
issue 910411
issue 910409

Blocking:
issue 842272



Sign in to add a comment

Liveness check fails occasionally on chrome restart

Project Member Reported by xiy...@chromium.org, Sep 11

Issue description

Chrome could be restarted when having custom flags (or policy switches). In some case (busy device?), chrome restart is not fast enough and missed a liveness poing. As a result, it is killed when liveness ping timeouts.

e.g.
  http://crash/fa1e98276659a75f
  http://crash/0ef826c8eabcc97a

Typical session manager log:
2018-09-01T10:32:09.342430+02:00 INFO session_manager[27678]: [INFO:browser_job.cc(148)] Running browser /opt/google/chrome/chrome --ppapi-flash-path=/opt/google/chrome/pepper/libpepflashplayer.so --ppapi-flash-...
2018-09-01T10:32:24.360316+02:00 ERR session_manager[27678]: [ERROR:object_proxy.cc(580)] Failed to call method: org.chromium.LivenessServiceInterface.CheckLiveness: object_path= /org/chromium/LivenessService: org.freedesktop.DBus.Error.ServiceUnknown: The name org.chromium.LivenessService was not provided by any .service files
2018-09-01T10:32:39.368552+02:00 WARNING session_manager[27678]: [WARNING:liveness_checker_impl.cc(65)] Browser hang detected!

Note the CheckLiveness call fail. Chrome is not aware of the ping at all and gets killed later on.
 
Cc: steve...@chromium.org mnissler@chromium.org
Components: OS>Systems
Labels: -OS-iOS
I'm not an expert on this code, but it looks like session_manager is already trying to handle this. SessionManagerService::RunBrowser() calls LivenessChecker::Start(), and SessionManagerService::HandleExit() calls LivenessChecker::Stop() (before calling RunBrowser() again).

I think that the logs that you quoted were from a non-stable-channel device since there was a 15s delay between Chrome being started and the D-Bus method call being made. See session_manager_main.cc:

constexpr base::TimeDelta kHangDetectionIntervalStable =
    base::TimeDelta::FromSeconds(60);
constexpr base::TimeDelta kHangDetectionIntervalDev =
    base::TimeDelta::FromSeconds(15);
constexpr base::TimeDelta kHangDetectionIntervalTest =
    base::TimeDelta::FromSeconds(5);

I don't know why we use different values here and wouldn't object to us always using 60s.

But why had Chrome still not owned the org.chromium.LivenessService service name 15s after it was executed? That seems broken.
Not sure what is happening on these devices either. The device seems to be very sluggish.

Using the first crash (http://crash/fa1e98276659a75f) as example. session_manager starts chrome in 2018-09-01T10:32:09.342430. Yet the chrome log file name is chrome_20180901-103232. So 10:32:09 -> 10:32:32, it takes 3s for chrome to create its log file. Did not see that very often.
Project Member

Comment 3 by bugdroid1@chromium.org, Sep 22

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/fefb938a693e9c39c900b38ac4ddd0eedc237093

commit fefb938a693e9c39c900b38ac4ddd0eedc237093
Author: Xiyuan Xia <xiyuan@google.com>
Date: Sat Sep 22 05:25:52 2018

login: Dump top tasks on browser hang

Liveness check ping fails more frequently than before. Crash
dump shows no trace of chrome busy or deadlock. However, the
device seems to be busy when the ping fails according to the
log line timestamp (e.g. 10s for session_manager start). Dump
top tasks to get more data of the situation.

Also remove lsof debug log since it is no longer needed.

BUG=chromium:883029
TEST=Manual.
Change-Id: I7302fbf0fbd68e49a30df297f90fc72cceeb5476
Reviewed-on: https://chromium-review.googlesource.com/1238939
Commit-Ready: Xiyuan Xia <xiyuan@chromium.org>
Tested-by: Xiyuan Xia <xiyuan@chromium.org>
Reviewed-by: Xiyuan Xia <xiyuan@chromium.org>

[modify] https://crrev.com/fefb938a693e9c39c900b38ac4ddd0eedc237093/login_manager/liveness_checker_impl.cc

Quick look at crashes for 71.0.3558.0 and the top 20 clients. 

The 1st one is busy flashrom, probably issue 865731.
The 2nd one is chrome. The crash stack looks not that busy though.
The 3rd is cras
The 4th is OOM, kswapd

livenss_crash_top_process.zip
15.9 KB Download
Blockedon: 865731
Project Member

Comment 6 by bugdroid1@chromium.org, Oct 2

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/6c6d2f3138d061739ddd83a774818f3ec0be238d

commit 6c6d2f3138d061739ddd83a774818f3ec0be238d
Author: Xiyuan Xia <xiyuan@google.com>
Date: Tue Oct 02 12:19:59 2018

login: Collect command line of top tasks on browser hang

- Collect top task command line (up to 512 chars);
- Trim by lines instead of string size;

BUG=chromium:883029
TEST=Manual.

Change-Id: Ifadf5c30ff62b6f645e6f4ed852be2e30c4efa8c
Reviewed-on: https://chromium-review.googlesource.com/1252548
Commit-Ready: Xiyuan Xia <xiyuan@chromium.org>
Tested-by: Xiyuan Xia <xiyuan@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/6c6d2f3138d061739ddd83a774818f3ec0be238d/login_manager/liveness_checker_impl.cc

Any update here? The associated signature has creeped up to be the #1 crash in the latest beta.
Looks like we got another spike on the flashrom issue. Looked the the top 3 clients in the list:

https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_ChromeOS%27+AND+product.Version+LIKE+%2772.%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AMessagePumpDefault%3A%3ARun%27

Will ping the flashrom bug with the new data.


Cc: abodenha@chromium.org
This bug was originally filed to handle an issue where Chrome could apparently get a SIGABRT from session_manager due to Chrome taking so long to restart (e.g. for custom flags) that session_manager thinks that Chrome is hanging.

Please be sure to not conflate that concern with all crashes where Chrome received a SIGABRT from session_manager. All that these SIGABRTs tell us is that Chrome was hanging. More investigation needs to happen to determine the cause (typically in the form of inspection of the main thread of the browser process).

If http://crash/browse?q=product_name%3D%27Chrome_ChromeOS%27+AND+product.Version+LIKE+%2772.%25%27+AND+expanded_custom_data.ChromeCrashProto.magic_signature_1.name%3D%27base%3A%3AMessagePumpDefault%3A%3ARun%27 (from #8) is what's being looked at here, I don't think that those crashes are meaningful. As I click through the sample reports, I'm just seeing various processes receiving SIGABRT while idle. Please only look at crash reports from the browser process here.
The top 3 clients in the crash link are mostly (if not all) killed due to liveness check and seems because of flashrom per chrome.txt. We can jump to relevant browser process crash by client id, 

e.g. for the top 1 client e59189053fa244b2a6c76a149f21cc31 :

https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_ChromeOS%27+AND+product.Version+LIKE+%2772.%25%27+AND+clientid%3D%27e59189053fa244b2a6c76a149f21cc31%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27
Should we file a separate issue for the flashrom problem? Any idea what's triggering the flashrom call that leads to the failure?
Blockedon: 910409
Blockedon: 910411
Blockedon: -865731
Yes, filed  issue 910409  for flashrom issue (was using issue 865731 but vbendeb@ says it is likely a different cause, hence the new bug).

Also filed  issue 910409  that wallpaper resizer also hogs CPU, the 2nd popular in the crashes I saw.


Figured out a crash query to show crashes caused by slow chrome start as in #0. Basically look at browser crashes with ProcessUptime < 30000.

Here is the one for M72:
https://crash.corp.google.com/browse?q=product_name%3D%27Chrome_ChromeOS%27+AND+product.Version+LIKE+%2772.%25%27+AND+expanded_custom_data.ChromeCrashProto.ptype%3D%27browser%27+AND+ProcessUptime+%3C+30000+AND+crash.Reason%3D%27SIGABRT%27#samplereports

Master issue 842272 has about 7506 crashes today and the query above has 1291 crashes. 7506 could roughly be converted to relevant browser crash by divide 3 (renderer/extension/utility) to 2500+. So this is about 50%.

Unfotuntely, I have not figured out a good way to further break it down to flashrom related, OOM/slash related, hang on chrome start (DRM/GPU related) but those are the noted causes so far.

We could also increase the ProcessUptime to 60000 and see liveness check kills but it is not due to slow chrome start when 30000 <= ProcessUptime < 60000. flashrom seems popular there too.
Project Member

Comment 17 by bugdroid1@chromium.org, Dec 7

The following revision refers to this bug:
  https://chromium.googlesource.com/chromiumos/platform2/+/f8ab5136f85ea9251b37a0ba4f59e1f42d4a2748

commit f8ab5136f85ea9251b37a0ba4f59e1f42d4a2748
Author: Xiyuan Xia <xiyuan@google.com>
Date: Fri Dec 07 06:06:43 2018

login: Collect pstree of flashrom

If flashrom shows up on the top output, dump
  pstree -sal -p <flashrom_pid>
output to log.

BUG=chromium:883029,chromium:910411
TEST=Manual

Change-Id: I3ed8084763feea188b6b6b19ee64ee834ab22055
Reviewed-on: https://chromium-review.googlesource.com/1362396
Commit-Ready: Xiyuan Xia <xiyuan@chromium.org>
Tested-by: Xiyuan Xia <xiyuan@chromium.org>
Reviewed-by: Dan Erat <derat@chromium.org>

[modify] https://crrev.com/f8ab5136f85ea9251b37a0ba4f59e1f42d4a2748/login_manager/liveness_checker_impl.cc

Sign in to add a comment